options(repos = c(CRAN = "https://cran.rstudio.com"))
install.packages("kableExtra")
## 
## The downloaded binary packages are in
##  /var/folders/bg/zt5bhk7j3d9_19gpq5b6j5d80000gn/T//RtmpMWMYRx/downloaded_packages
library(kableExtra)

1 Introduction

We will pick up where we left off in Demo 1 to prepare you for analysis. Although this is not going to be as exhaustive as we would like to do if we were doing a complete project, our goal is to see this project go from messy data \(\rightarrow\) prediction & interpretation. After all, we need to give something to the patient advocacy watchdog that they can deploy!

1.1 Our Analysis Plan

Our goal is to see this analysis through to generate a predictive model that the advocacy group can use, and hopefully it will be a decent one. To ensure that it stands the best chance of being a good predictive model, our analysis plan includes all the key steps in the data science life cycle/project flow, including: cleaning, exploration, feature selection, pre-processing, unsupervised machine learning exploration, and supervised modeling. Whew!

Our specific plan is to perform two big analyses in the hopes of having a nice deliverable to present to our client. The first will be a segmentation analysis, or clustering analysis, which will attempt to group our hospitals together on the basis of their attributes and their pneumonia-related readmissions. Thie will be done using a combination of PCA and \(k\)-means. Our second analysis will be to perform supervised, predictive analysis using an elastic net, which conveniently combines a penalized regression with variable selection so we can point our clients down a path of hospital attributes to focus on, as well as a model they can use to make future predictions.

1.2 Your objective.

Your oject is to work through this Project 1. You will have 31 Questions to answer as you work through this, just as with Project 1. Then, you will be able to choose your own adventure again to perform your own clinical informatics analysis:

Adventure 1.

[Minimal coding, focus is on understanding cleaning, pre-processing, and the ML analyses we performed here.] Choose ONE other condition (anything other than pneumonia) and run through the analyses again, using our same target variable. You will update your question, hypotheses, and predictions, but will be able to re-use all of the other code with very minor modifications. Credit will be based more on interpretation and not on coding. See more information on Adventure 1 here.

Adventure 2.

[More coding, but not as much, with a focus on the steps we’ve undertaken here and extending it to a new question & condition.] Choose a slightly more complicated analysis to undertake, for example, focusing on surgical interventions (HIP-KNEE & CABG) or heart-related conditions (HF, AMI, & CABG). Coding should still be fairly minimal, but you are likely to run into problems with the code working exactly as-is, especially during cleaning. The advantage of this analysis is that it will be much more robust, have larger sample sizes, and would allow us to be able to say something far more informative to a subset of patients (e.g., those considering undergoing surgical interventions). See more information on Adventure 2 here.

Adventure 3.

[Most coding, but no cleaning or pre-processing, with an emphasis entirely on the machine learning analyses.] Continue with the pneumonia-related data that you have here, completely cleaned and pre-processed, and do two more unsupervised OR supervised analyses. For example, you could do another kNN using the \(k\) we identify in our Segmentation Analysis here, or you could do a random forest to account for any possible non-linearity. You could even choose to do a Ridge and LASSO regression and compare those analyses to what we get out of an Elastic Net. Any of these are fine, but you must explain your decision in your submission. See more information on Adventure 3 here.

1.3 A Fork in the Road!

This is the first of your choices in your adventure, and what you choose here could either lead to success or… sudden death (dum, dum, dummmm…). Choose wisely, young data scientists!

Option 1.3.1

Load data from the end of Demo 1, which includes the ordinal and frequency encoding you did (plus a little bonus cleaning to help).

Load my version of the fully cleaned, encoded, & ready-to-proceed dataset from the end of Demo 1. These data are not yet ready for analysis (!) but we would be picking up with that here.

## Uncomment if this is your choice:
load("/Users/stephgarofalo/Documents/GitHub/Merrimack_DSE6630/I. Biomedical & Clinical Informatics/Project_1/FY2024_data_files/pneumoniaAnalyzeFullyEncoded2024.Rdata")
dat2Analyze <- pneumoniaAnalyzeFullyEncoded

Option 1.3.2

Load data from Demo 1 without ANY encoding, but all other tidying the same.

## Comment out if this is NOT your choice:
## Uncomment if this is your choice:
load("/Users/stephgarofalo/Documents/GitHub/Merrimack_DSE6630/I. Biomedical & Clinical Informatics/Project_1/FY2024_data_files/pneumoniaAnalyzeNoEncoding2024.Rdata")
##dat2Analyze <- pneumoniaAnalyzeNoEncoding

Question 1: [1 point]

Which dataset do you choose and why?

I chose this dataset because it has already been fully encoded and includes the ordinal and frequency encoding applied in Demo 1. This means it is ready for analysis with minimal additional preprocessing required, which will save time and ensure consistency with the work we did in Demo 1. Additionally, using this pre-encoded dataset minimizes the risk of introducing errors or inconsistencies in encoding. If I had chosen the unencoded dataset (Option 1.3.2), I would have had to repeat the encoding process manually, which could lead to inconsistencies or errors. Therefore, using the fully-encoded dataset ensures we build on the previous work efficiently and accurately.

1.4 Helper code: Using R like an object-oriented language

Many of you still may be early enough in your programming journeys that you haven’t yet learned how to fully leverage the objected-oriented nature of programming languages like Python. R is not technically an object-oriented language in the same way, but we can hack it to function like one. That’s exactly what we are going to do here by loading a helper function directly from the source code.

We went through all the trouble in the Demo to perform encoding and now we get to use it! You will use the source() function to load a helper function that I have written, using my answer to the ordinal encoding question of the Demo, so that you can do all your ordinal encoding in a single line of code. Neat, huh?

BUT WHY are we doing the encoding now if we just chose to bring in non-encoded data? Because we can do ordinal encoding before or after data partitioning with impunity. It does not introduce data leakage to do so here, so we can do it in either order. This means that we can do it now simply for convenience.

NOTE: If you chose Option 1, the fully-encoded dataset, you can still run this chunk because there is a conditional in here that will not run your dataset since it is already encoded! But still make sure you understand the idea of source code because we will keep using it throughout the course!

## This calls the code in the associated .R file
source(file = "doOrdinalEncoding.R")

## This checks to see if you loaded the already encoded dataset
if(!exists("pneumoniaAnalyzeFullyEncoded")) {
    ## This sets a list of PATTERNS to match for the columns to ordinal encode
    encodeList <- c("ComparedToNational_", 
                    "PaymentCategory", 
                    "ValueOfCareCategory",
                    "Score_Emergency department volume")

    ## This runs the function, which takes 3 arguments
    ## Open the source code to see what each argument does!
    dat2Analyze <- doOrdinalEncoding(df = dat2Analyze,
                                     encodeList = encodeList,
                                     quiet = FALSE)
}

2 Exploratory Data Analysis: Which of the possible target (outcome) variables should we use? (Continued)

NOTE: What follows is my (long-winded) answer to Question 19 from the Demo.

2.1 Overview of our options

I am going to quickly break down the differences between the possible targets so we can make sure we really understand what they are measuring to allow us to make our most-informed selection of a target. Note: More information was found in this PDF. This is really important for helping us choose the right one!

Notice that I am also including a variable we didn’t look at previously called ComparedToNational_Hospital return days for pneumonia patients. This is because this is our best assessment of how the hospital is performing with regard to pneumonia readmissions relative to a national average. However, even though I’m including it in the table, I think it’s better to think of it as a solid predictor. It’s telling us how length of stay (LOS) in the hospital after pneumonia readmission stacks up compared to the national average. After a hospital readmission, it’s generally better to have fewer return days rather than more because it suggests better quality of care, more timely follow-up, and a greater chance of the patient recovering well at home. Since these are readmissions, having more days than average could also suggest that the hospital missed the severity or didn’t properly plan for the patient’s recovery at home and released them too early.

You may be wondering - could we use it as a target? Sure - but it’s not for the question we were posed by our stakeholder! Note that, if we did, we would need to perform a multinomial classification analysis. So, instead, let’s use this category to help us contextualize our hospitals as we choose between Predicted, Observed, and Expected readmission rates, as well as Excess Readmission Ratio.

Table 1. List of possible target variables from the pneumonia-related hospital readmission data. Which one to choose?
Possible Target Description
ComparedToNational_Hospital return days for pneumonia patients Hospital return days measures the number of days patients spent back in the hospital (in the emergency department, under observation, or in an inpatient unit) within 30 days after they were first treated and released for pneumonia. Reported as compared to the national average, such that ‘below average’ is better and ‘above average’ is worse.
ExpectedReadmissionRate The expected number of readmissions in each hospital is estimated using its patient mix and an average hospital-specific intercept. It is thus indirectly standardized to other hospitals with similar case and patient mixes.
PredictedReadmissionRate The number of readmissions within 30 days predicted based on the hospital’s performance with its observed case mix. The predicted number of readmissions is estimated using a hospital-specific intercept, and is intended to reflect the annual expected performance of the hospital given its historical case and patient mix and performance.
ExcessReadmissionRatio The ratio of the predicted readmission rate to the expected readmission rate, based on an average hospital with similar patients. Performance is compared against a ratio of one, such that below one is better and above one is worse in terms of readmission rates.
observed_readmission_rate Our calculation of the observed number of pneumonia-related readmissions within 30 days found by dividing the number of readmissions observed for the hospital during the given period by the number of discharges for that same period, and multiplied by 100 to put it on the same scale as the Predicted and Expected Readmission Rates. It reflects a true observation as reported by the hospital during that period, but is not adjusted for case mixes or prior information for that hospital. Thus, this is a crude, unadjusted value.

2.2 Summary statistics of the possible targets

Now, I know I didn’t ask you to do it, but I will actually begin my assessment of which target is most appropriate by constructing a quick summary table. In my assessment of what is an appropriate target,

Table 2. Possible target measure summaries
Characteristic N = 4,8161
National Comparison Return Hospital Days
    Better than average 402 / 4,816 (8.3%)
    Same as average 2,285 / 4,816 (47%)
    Worse than average 827 / 4,816 (17%)
    Unknown 1,302 / 4,816 (27%)
Predicted Readmission Rate 16.46 (1.93), 16.27
    (Missing) 2,090
Observed Readmission Rate 17.1 (3.7), 16.9
    (Missing) 2,603
ExcessReadmissionRatio 1.00 (0.06), 1.00
    (Missing) 2,090
Expected Readmission Rate 16.43 (1.51), 16.39
    (Missing) 2,090
1 n / N (%); Mean (SD), Median

We can see right away that we lose nearly 600 more data points if we opt to use our calculated observed_readmission_ratio (NOTE: nearly 500 if using the 2025 FY data), which is going to be a by-product of the fact that the Predicted and Expected readmission ratios are using prior information and information about patient mixes to generate these values. Our observed_readmission_ratio is crude and unadjusted any way - which could present problems when it comes to true comparability. So, off the bat it feels as if the constructed target is possibly not a good choice despite its simplicity for stakeholders to understand.

2.3 Comparing the four possible metrics toComparedToNational_Hospital return days for pneumonia patients

Although ComparedToNational_Hospital return days for pneumonia patients is very compelling, it’s better as a predictor rather than a target because it gets at an aspect of hospital performance: once readmitted due to pneumonia, how long (as compared to the national average) are patients spending in the hospital? This means it makes a really nice predictor to use to compare our four possible choices for pneumonia-related hospital readmissions. We are looking for a clear separation in our ideal target between hospitals that have lower readmission rates AND also tend to do better than the national average at healing their readmitted patients!

So, this makes a really nice variable I could choose to facet plots, which means creating sub-plots that show up as panels. The thing is, if I know I want to facet plots, the first thing I have to do is reshape the data I want from wide into long format using, for example, the pivot_longer() function (the complement of pivot_wider() that we used in the Demo).

I don’t have to store it as a separate data frame (i.e., I can just pipe it directly into ggplot()!) but here I am choosing to so we can more easily make sure it’s reshaped correctly along the way. Thus, I am making a long-format version of the possible target variables called longTargets so that I can better decide which target to choose.

I mentioned violin and density plots specifically in Question 19 of Demo 1, as well as faceting. I am going to show you two ways to explore these data with these types of plots and you can decide which one(s) you find most informative here. You also may have thought of a better way to display the data too!

2.3.1 Example with geom_violin():

Here I am leaving the NA data (which I relabelled as Unknown) to get a sense for the degree of missingness.

Figure 1. Readmission Rates vs. readmitted hospital LOS

Figure 1. Readmission Rates vs. readmitted hospital LOS

Question 2A: [0.5 points]

If you have never encountered violin plots before, they are honestly a wonderful way to assess the symmetry and skew of a distribution while more easily enabling comparisons between groups. They’re effectively a hybrid between a histogram and a boxplot, and we can even opt to overlay the boxplot quartiles in the plots as well, which is particularly handy!

Your task is to do just that by adding geom_boxplot(width = 0.1) (yes, just that one line!) in the appropriate place. Also change include = FALSE to include = TRUE in the R chunk header so that it shows up when you knit!

Make sure to comment your line to call attention to it:

Figure 2. Readmission Rates vs. readmitted hospital LOS

Figure 2. Readmission Rates vs. readmitted hospital LOS

Question 2B: [0.5 points]

Now that you can compare these groups more easily, what we’re looking for is a balance between clear difference between hospital ordinality relative to the national average (i.e., “Are better hospitals really better in pneumonia readmission rates?”).

Which target(s) appear the most distinct to you? > The Predicted Readmission Rate target appears the most distinct, as it shows the clearest separation between hospital categories (“Better,” “Same,” “Worse”) in both the boxplots and violin plots. This suggests it is the best target for capturing differences in pneumonia readmission rates across hospitals.

2.3.2 Example with geom_density():

We could also choose to overlay density plots instead of violin or boxplots, as this can sometimes help us see subtler separation of groups than we can see otherwise. This time, I am also going to ignore the “Unknown” (i.e., missing) class of hospitals to prevent things from getting too messy!

Question 3: [0.5 points]

Does this view confirm or change your decision from Question 3?

This view confirms my decision to choose Predicted Readmission Rate as the target. It shows the clearest separation between hospital groups, while the Observed Readmission Rate has more overlap, especially between “Same” and “Worse” categories. Therefore, Predicted Readmission Rate is the most distinct and appropriate target for modeling pneumonia readmissions.

2.5 Final remarks on target selection.

Going forward, I have selected PredictedReadmissionRate as our target. In a more thorough analysis, we would likely want to consider:

  1. Circling back to our stakeholder to try to refine our question or target using domain knowledge

  2. Perform experiments with our models to find the best predictive model since that is our ultimate goal.

However, given that we do not have the option or time for that, let me explain my choice. It’s the best balance of distinguishing hospitals with better and worse readmitted hospital stays, while still providing an adjusted, hospital-specific measure. Adjusted values will generally always be a better choice than the crude, even if our crude measure captures something interesting about the data, as it did here, or if it’s easier for our stakholder to comprehend. Our crude measure, observed_readmission_rate did the best job of recapitulating the readmitted hospital LOS (ComparedToNational_Hospital return days for pneumonia patients), but if that’s really our focus, wouldn’t it then be better to switch from a readmission rate to that as our target? Probably.

If you still feel unsure about what is the best choice, you’re not alone - so do I! This is where, because we are working with an imagined stakeholder, we must make the best decision we can given the question we defined in the Demo. I would defend my choice of Predicted Readmission Rate for three key reasons:

  • It has been adjusted to be hospital-specific, using a hospital-specific intercept and historic data about the hospitals’ performances

  • We don’t lose nearly 600 data points to missingness.

  • Our imagined stakeholder, and the question we set, was about readmission rate so even though readmitted return hospital days might also be a similarly viable candidate, it doesn’t reflect our imagined stakeholder’s request. Note that the ExpectedReadmissionRate is standardized for similar hospitals, and may not truly reflect a given hospital; the ExcessReadmissionRatio, while informative, was the least likely to reflect the distinctiveness in hospital return days, which is still of interest to us.

But remember that what matters most is that we are matching our question.

Question 7: [1 point]

Do you agree with my choice of target? Why or why not? (You are free to disagree!)

I agree with your choice of Predicted Readmission Rate as the target. It provides a hospital-specific, adjusted measure that aligns with the stakeholder’s interest in readmission rates. While the Observed Readmission Rate showed stronger alignment with readmitted hospital LOS, it’s not adjusted, and as you noted, using that as a target could shift the focus of the analysis. The Predicted Readmission Rate also retains more data points, which is valuable for model robustness. I appreciate the balance of statistical rigor and stakeholder relevance in your choice, and I agree that matching the analysis to the question is key.

2.6 Bias and Limitation

Question 8: [2 points]

All studies have bias (introduced or systemic error) and limitations (failure to fully explain something). We could go more deeply into these concepts, but for now I want you to take a stab and just brainstorming either ways we might have bias OR the limitations of using this target variable. You do not need to answer both, but you’re welcome to do so. You can read more about bias here or see some deeper explanations of limitations here. You do not need to write a lot; just 2-3 sentences. I am simply asking you to pause and reflect on our choices here before we proceed.

One potential bias is modeling bias—the algorithm used to adjust for hospital differences may not fully capture all the factors influencing readmission rates, such as socioeconomic status, staffing, or regional differences. A key limitation is that this variable reflects a modeled estimate, not direct observations, which may oversimplify complex relationships between hospitals, patients, and outcomes.

3 Pre-processing and Feature Selection

The time has finally come to officially move back into pre-processing and feature selection so that we can finally build some models. I want you to take note of the order in which I’m doing things here; the flow can change from project to project, but generally it’s a good idea to get your dataset to a state of manually curated features before making your data partition.

3.1 Further feature selection and refinement.

3.1.1 Dropping the targets we aren’t planning to use. [Manual Method]

It’s time to say goodbye to the features we know would be far too redundant or non-informative given our target of PredictedReadmissionRate. We don’t need any of the Sample_ columns, for example; these are merely sample sizes. Useful if we were doing standardization, but not necessary for us right now. We also are dropping the other possible target features that are too redundant with PredictedReadmissionRate.

Question 9A: [0.5 points]

Add your comments to the code as indicated to explain what I was doing at each of the steps.

Your answer in the comments.

dat2Analyze <- dat2Analyze %>% 
## Drop columns we no longer need—this includes the other readmission target variables we aren't using, sample size columns, and number of surveys completed.
  select(-ExpectedReadmissionRate, 
         -ExcessReadmissionRatio,
         -observed_readmission_rate, 
         -contains(c("Sample_", "NumberOfPatients")),
         -NumberSurveysCompleted) %>% 
## Rename columns for clarity—make long and unwieldy column names shorter and easier to work with.
  rename(`Median time (minutes) patients spent in ED` = 
         `Score_Average (median) time patients spent in the emergency department before leaving from the visit A lower number of minutes is better`,
         `Composite patient safety` = `Score_CMS Medicare PSI 90: Patient safety and adverse events composite`,
         `ComparedToNational_Composite patient safety` = `ComparedToNational_CMS Medicare PSI 90: Patient safety and adverse events composite`) %>% 
 ## Reorder columns so that our target variable, PredictedReadmissionRate, comes first for convenience.
  select(PredictedReadmissionRate, everything())

## Clean up column names by removing unnecessary prefixes from the names.
colnames(dat2Analyze) <- gsub('Score_|HcahpsLinearMeanValue_|_Payment', '', colnames(dat2Analyze))

3.1.2 Dropping any features with near-zero variance. [Automated Method]

Next, we are going to take advantage of the nearZeroVar() function that is part of the caret package. We will drop any of the columns with near-zero variance. This is a way of automating (and thus standardizing) the process for choosing to drop non-informative columns due to zero or nearly zero variance.

Question 9B: [0.5 points]

If you are not familiar yet with the concept of zero or near-zero variance, do a little digging. Based on what you read, try to explain for your naive advocacy group stakeholders, why it’s important to remove the features with little variance.

If a column has zero or near-zero variance, it means that almost all the values in that column are the same (or very close to each other). This makes the feature unhelpful for predicting outcomes, because it doesn’t provide meaningful information to distinguish between different cases. Keeping these low-variance features in our model can actually confuse or slow down the algorithm, so it’s important to remove them. By dropping them, we help our models focus only on the features that can really help make accurate predictions.

Table 3. Columns dropped due to near-zero variance.
Variables Dropped
PaymentCategory for pneumonia patients
ComparedToNational_Postoperative respiratory failure rate
ComparedToNational_Perioperative pulmonary embolism or deep vein thrombosis rate

Take note of the number of features here.

Question 9C: [0.5 points]

You may or may not have noticed that I moved FacilityID to our rownames in the datasets you loaded today. Why did I choose to do that? Why do I want to make sure unique identifiers, like FacilityID, are out of the dataframe before I either (a) run nearZeroVar() or (b) proceed with the rest of the analysis?

We moved FacilityID to the row names because it is a unique identifier, not a feature that provides meaningful information for our analysis. If we left it in the dataframe, functions like nearZeroVar() might mistakenly keep it (since each value is unique), or it could affect the analysis by introducing noise that doesn’t relate to hospital performance. Unique IDs don’t help predict readmission rates; they just label each hospital. Removing them from the feature set ensures our model focuses on real predictors, not arbitrary labels.

Question 9D: [1 point]

Do some research on, or lean into your previous knowledge, about the curse of dimensionality or the big-P, little-N (n >> P) problem. Using whatever you find or remember, can you justify why we likely do NOT have a big-P, little-N (n >> P) problem here? Make sure to explain.

We likely do not have a big-P, little-N problem because our dataset has many more observations (hospitals) than features (predictor variables). A big-P, little-N problem occurs when we have a large number of predictors (P) and relatively few observations (N), which increases the risk of overfitting and makes the model struggle to generalize. In our case, the number of features is manageable compared to the total number of hospitals in the dataset, so we are not at high risk of overfitting due to too many variables.

3.2 Assess missingness & devise an imputation plan

We are going to deal with imputation in a rather quick and elegant way that requires minimal effort from us using a model-based approach with missForest. You may recall from lecture that I said we are choosing a model-based approached without uncertainty because our ultimate goal is to create a deployable model for our stakeholder.

Question 10: [1 point]

Write some code that will quickly tally the number of columns in the dataset that contain missing values. You will find colSums() and is.na() functions likely useful here. What percent of the features have at least some level of missingness?

## Count columns with missing values
missing_cols <- sum(colSums(is.na(dat2Analyze)) > 0)
missing_cols
## [1] 29
## Total number of columns
total_cols <- ncol(dat2Analyze)

## Calculate percent of features with missingness
percent_missing <- (missing_cols / total_cols) * 100
percent_missing
## [1] 96.66667

96.66%

Which column(s) are not missing any data?

State.

How many complete observations do we have in the dataset? You may find nrow() with complete.cases() or drop_na() useful!

# Count complete cases
sum(complete.cases(dat2Analyze))
## [1] 644

3.2.1 Use the md.pairs() function from the mice package to assess the extent of missing values before imputing.

When we are missing a lot of data, as we are here, it can sometimes be useful to predetermine whether the patterns of missing correlate with our target. Why? We care if missing values correlate with the target because this may indicate that the missingness itself carries information. Thus, ignoring it could lead to biased models or low predictive power!

In fact, missingness can itself become a predictor. If a missing value is associated with the outcome, it may carry predictive signal. For example, in our data, hospitals with high rates of missingness in Left before being seen highly associated with pneumonia readmission might indicate a worrying pattern. Perhaps hospitals are negligently failing to report how many of their patients leave without being seen because of larger, systemic issues that ALSO happen to impact overall quality and patient care. Thus, missingness could itself become a predictor!

To do, I will leverage the md.pairs() function from the mice package which calculates the pairwise correlations of missingness between variables and allows us to visualize those patterns using a clustered heatmap. When I do this, I will drop State from the matrix because it will create errors if I retain it. (Why?)

## Drop State and analyze the pairwise missingness using the mice package
miceMatrix <- dat2Analyze %>% select(-State) %>% md.pairs()

What md.mice() does is create two pairwise correlation matrices, mr and mm, where:

  • mr = Missing vs. Responded
    • Pairwise matrix showing how often the row has a missing value in one variable and an observed (i.e., non-missing or “responded”) value in the other
  • mm = Missing vs. Missing
    • Pairwise matrix showing how often the row has a missing value in one variable and missing value in the other variable

Let’s take a look at the first 6 rows and 4 columns of the mr matrix:

Table 4. Missing vs. Reponded Matrix Slice
PredictedReadmissionRate ComparedToNational_MRSA Bacteremia MRSA Bacteremia Payment for pneumonia patients
PredictedReadmissionRate 0 48 48 630
ComparedToNational_MRSA Bacteremia 1041 0 0 1650
MRSA Bacteremia 1041 0 0 1650
Payment for pneumonia patients 21 48 48 0
Abdomen CT Use of Contrast Material 37 42 42 69
ComparedToNational_Death rate for pneumonia patients 10 47 47 6

Any given cell in a matrix is always of the form \([i, j]\), where \(i\) is the row and \(j\) is the column - just as you are used to with dataframes! So, in these correlation matrices of our pairwise variables, cell \([1,1]\) is PredictedReadmissionRate vs. itself, so the count is 0 - as is every value on the diagonal, like with all correlation matrices. For cell \([1,2]\) PredictedReadmissionRate vs. , there are 42 observations where variable PredictedReadmissionRate (\(i\)) is missing but MRSA Bacteremia (\(j\)) was observed. Thus, we can extract the number of cases where variable \(i\) is missing and variable \(j\) is observed using miceMatrixmr[i, j].

Now, let’s take a look at the first 6 rows and 4 columns of the mm matrix now:

Table 5. Missing vs. Missing Matrix Slice
PredictedReadmissionRate ComparedToNational_MRSA Bacteremia MRSA Bacteremia Payment for pneumonia patients
PredictedReadmissionRate 2090 2042 2042 1460
ComparedToNational_MRSA Bacteremia 2042 3083 3083 1433
MRSA Bacteremia 2042 3083 3083 1433
Payment for pneumonia patients 1460 1433 1433 1481
Abdomen CT Use of Contrast Material 852 847 847 820
ComparedToNational_Death rate for pneumonia patients 1292 1255 1255 1296

Similarly, miceMatrix$mm[i, j] gives us the number of cases where both variables \(i\) and \(j\) were co-missing.

Question 11A: [0.5 points]

Explain what you think the number in cell \([6,1]\) means. Out of all the observations in the dataset, does this feel like a high degree of co-missingness?

The number in cell [6,1] (1,292) means there are 1,292 rows where both variables—PredictedReadmissionRate and ComparedToNational_Death rate for pneumonia patients—are missing. Given our dataset has over 4,800 rows, this indicates a moderate level of co-missingness that may impact our model’s performance and needs to be addressed.

Lastly, we can calculate a proportion of missingness vs. response between any two variables \(i\) and \(j\) such that:

\(Proportion_{i,j} = \frac{mr_{i,j}}{mr_{i,j} + mm_{i,j}}\)

In non-mathy speak, it’s the fraction of the times that variable \(j\) is observed among rows where \(i\) is missing. So, if you want to impute variable \(i\) - let’s say it’s MRSA Bacteremia - then you will be interested in the variables typically observed (present in the dataset) when \(i\) is missing. Those are the variables that will be able to help us predict, or fill in, \(i\).

  • If this ratio is close to 1, \(j\) is nearly always available when \(i\) is missing, so \(j\) might be useful for imputing \(i\).
  • If this ratio is close to 0, \(j\) is usually missing when \(i\) is missing, so it’s not helpful for imputing \(i\). You can also use this to tell you about patterns of missingness with regard to your target variable when you target is \(i\).
Figure 5. Correlation between Missing and Response

Figure 5. Correlation between Missing and Response

It may feel a little counterintuitive, but here darker colors indicate a higher correlation between missingness and response (proportions nearer to 1). In other words, these are the variables that are going to be more informative during imputation for a given variable.

But what if we wanted to know about patterns of missingness with regard to our target, PredictedReadmissionRate? Well, we could get that too!

Figure 6. Correlation between Missing and Missing for Target

Figure 6. Correlation between Missing and Missing for Target

Question 11B: [0.5 points]

What columns seem to be most correlated in terms of missingness to our target, PredictedReadmissionRate? Does this give you any cause for concern? Feel free to speculate as to why you might see this.

The most correlated columns in terms of missingness with PredictedReadmissionRate are MRSA Bacteremia, ComparedToNational_MRSA Bacteremia, Postoperative respiratory failure rate, and Medicare spending per patient. This suggests a potential bias in our data—hospitals with missing values here may also have higher readmission risks. We should consider if the missingness itself is informative or due to data quality issues.

3.3 Drop rows that are missing from the target variable.

We will not be able to use those rows at all moving forward, sadly. That is going to reduce our sample size pretty sizably, from \(N = 4,816\) to only \(N = 2,726\) (or from \(N = 4,775\) to only \(N = 2,731\) .

## [1] "Rows: 2726"  "Columns: 30"

Question 12: [0.5 points]

Thinking back to your answer to Question 12, do you feel any differently given these dimensions of the dataset?

Given the dataset now has 2,726 rows and 30 columns, I feel more confident that we do not have a big-P, little-N problem—meaning we have more observations (N) than features (P). This is a good sign for model stability. However, the drop from over 4,800 rows to 2,700 after filtering for non-missing targets indicates that missingness may still affect our model and should be considered in further analysis.

3.4 Do the splits!

At this stage, we are almost ready to impute - but not until we have done our data partitioning first! In fact, splitting our data into training and testing partitions first is really important to prevent data leakage.

Question 13: [1 point]

Explain to your stakeholder, in simpler language, (1) why we partition data into testing and training sets and (2) what data leakage is. You do not need to go into great detail, but it should explain enough that they understand the basic ideas.

We split the data into training and testing sets so we can build and train our model on one set of data (the training set) and then test how well it performs on data it has never seen before (the testing set). This helps us get a realistic picture of how the model would perform in the real world, on new data. Data leakage happens when information from the testing set accidentally influences the training process—kind of like giving the model the answers before the test. If this happens, our model might look great during testing, but it won’t actually generalize well in the real world. That’s why we have to keep the testing set separate and untouched until the very end.

You may recall that I mentioned that the optimal ratio is has been argued to be \(\sqrt{p} : 1\), where \(p\) is the number of parameters (which may or may not equal your predictors depending on your planned analysis). Note that you do not have to split only a single time; you could make different splits for different analyses, if needed. However, we are going to use a single split here for convenience.

Now, wouldn’t it be nice to have a handy function that could calculate the optimal split ratio for us? So, let’s write one! This is something you can use with ANY analysis going forward!

Question 14A: [1.5 points]

There are three parts to this question to get it to all come together correctly.

  1. Comment the function to explain what it does (I’ve defined the arguments for you)

  2. Run the function (fill in the blank)

  3. Fill in the missing piece in the data partitioning chunk below

calcSplitRatio <- function(p = NA, df) {
  ## @p  = the number of parameters (predictor variables). If not specified, it defaults to the number of predictors in the dataset.
  ## @df = the dataframe to use for analysis
  
  ## If p is not provided, set p to the number of columns in df minus 1 (for the target column)
  if(is.na(p)) {
    p <- ncol(df) - 1   ## Assuming the last column is the target, we subtract 1
  }
  
  ## Calculate the ideal number of observations to set aside for testing, based on the number of predictors
  test_N <- (1/sqrt(p)) * nrow(dat2Analyze)
  
  ## Calculate the proportion of the dataset to set aside for testing
  test_prop <- round(test_N / nrow(dat2Analyze), 2)
  
  ## Calculate the proportion for training by subtracting the test proportion from 1
  train_prop <- 1 - test_prop
  
  ## Print the recommended train:test split ratio
  print(paste0("The ideal split ratio is ", train_prop, ":", test_prop, " training:testing"))
  
  ## Return the training proportion (for use in createDataPartition())
  return(train_prop)
}
train_prop <- calcSplitRatio(df = dat2Analyze)
## [1] "The ideal split ratio is 0.81:0.19 training:testing"

Hint: One important note here is that if the number of parameters are not provided (which is true for the starter code I gave you to run it - it does NOT pass any parameters to that argument!), then by default our function is designed to take the number of columns of the dataframe and subtract one for the target variable. Neat, huh?

Now, uncomment this code and fill it out to run it.

ind <- createDataPartition(dat2Analyze$PredictedReadmissionRate,   ## Put the target here!
                           p = train_prop,                         ## Use the object you just made!
                           list = FALSE)

train <- dat2Analyze[ind, ]                                        ## Use the index created by createDataPartition()
test <- dat2Analyze[-c(ind), ]                                     ## Complementary index for test set

Question 14B: [0.5 points]

What split ratio did you get? Is it close to the canonical 80-20 split? Why do you think that happened?

Yes, the split ratio is close to the canonical 80-20 split, which is a common starting point for partitioning data into training and testing sets. This happened because the formula we used is designed to balance the number of predictors (p) with the total sample size (n), and in this case, the relationship between the number of predictors and the total observations resulted in a split near 80-20. This balance helps ensure that there are enough observations in both sets for effective modeling while avoiding overfitting.

Did you get stuck somewhere? If so, load the test and train data in case you struggled with Question 18 so you can proceed:

load(file = "FY2024_data_files/pneumoniaTrain.Rdata")
load(file = "FY2024_data_files/pneumoniaTest.Rdata")

3.5 Impute missing variables using missForest.

We are going to take advantage of the missForest() function from the missForest package. In a nutshell, what missForest does is it will fit either a regression- or classification-based random forest model and use the OOB (“out-of-bag”) results to predict and fill in NAs. The advantage is that this process is non-linear, done in just a few lines of code, and can handle mixed data-types (although you need to make sure that all your numerical types are of the same numerical type, and that all characters or factors are of the same type).

3.5.1 Running missForest

The way I have chosen to do this is to separate the data so that I can turn the encoded categories back into categories temporarily, then stitch them back together into a temporary dataframe so that I can run missForest. Notice that I temporarily drop the State and target features; I drop State to spare ourselves a little computational time and I drop PredictedReadmissionRate to prevent data leakage. Be patient!!. Since this is randomForest, it could take a minute or two to run. NOTE: If your machine fails to run this, there are data you can load. Some computers may struggle to run this.

# Remove encoded categorical variables for imputation, plus the target and state (which have no missing)
data_sans_encoded = train %>% 
  select(-PredictedReadmissionRate,
         -State,
         -`Emergency department volume`, 
         -contains("ComparedToNational"))

data_cat = train %>% 
  select(`Emergency department volume`, 
         contains("ComparedToNational")) %>% 
  mutate_all(as.factor)

# Stitch them back together:
temp <- cbind(data_sans_encoded, data_cat)

# Impute missing values using missForest
imputedTrain <- missForest(temp, 
                          variablewise = TRUE,
                          verbose = FALSE)

NOTE: Did you catch that I not only removed the target before imputation, but I dropped State too? This is because we know State is/will be frequency encoded (depending on our path), and it could create imputation issues as a result. This leaves me with 29 columns to perform imputation on.

3.5.2 Load this if it will not run for you

load(file = "FY2024_data_files/imputedTrain.Rdata")

3.5.3 Quick summary of the results

NOTE: If yours didn’t run, make sure to look at the table output in the knitted HTML I provided you. Also, change the include=FALSE in the chunk header if you were not able to get missForest to run.

Table 6. missForest OOB Error Rates for the imputed variables, training dataset
OOB Error Variable
MSE
0.28 MRSA Bacteremia
948452.73 Payment for pneumonia patients
17.67 Abdomen CT Use of Contrast Material
3.72 Death rate for pneumonia patients
6.81 Postoperative respiratory failure rate
0.45 Perioperative pulmonary embolism or deep vein thrombosis rate
0.01 Composite patient safety
0.00 Medicare spending per patient
291.21 Healthcare workers given influenza vaccination
5.08 Left before being seen
124.19 Venous Thromboembolism Prophylaxis
0.95 Nurse communication
1.81 Doctor communication
4.52 Staff responsiveness
4.36 Communication about medicines
4.69 Discharge information
1.25 Care transition
11.94 Cleanliness
14.26 Quietness
0.90 Overall hospital rating
2.37 Recommend hospital
0.00 SurveyResponseRate
PCF
0.09 PaymentCategory for pneumonia patients
0.46 Emergency department volume
0.13 ComparedToNational_MRSA Bacteremia
0.06 ComparedToNational_Death rate for pneumonia patients
0.04 ComparedToNational_Composite patient safety
0.36 ComparedToNational_Hospital return days for pneumonia patients

Question 15: [1 point]

What does MSE (Mean Squared Error) and PCF (Proportion of Falsely Classified) indicate here? You may find looking at the missForest Vignette helpful. Which variable(s) had the highest OOB error?

MSE and PCF show how much error or misclassification occurred when filling in missing data. The variable with the highest MSE was Payment for pneumonia patients, followed by Healthcare workers given influenza vaccination and Venous Thromboembolism Prophylaxis, meaning these were the hardest to predict accurately. The highest PCF (classification error) was for Emergency department volume, showing it had the most classification mistakes. These variables may have complex or missing patterns that made imputation challenging.

NOTE 1: If the OOB error (either MSE or PCF) from missForest is really high, it’s a red flag that the imputation for those variables is untrustworthy. This could be do to too much missingness, weak relationships among the variables making it hard to predict, or low quality data (e.g., noisy, sparse). In such cases, we want to DROP any variables with extremely high OOB error because they add noise. A general rule of thumb is that if the PCF > ~0.3 (30% of the imputed values are wrong) or if there is especially high MSE (inflation) relative to the variance of the variable, we should drop it. Let’s set those aside so we can drop them from test and train both downstream!

NOTE 2: Even though the PCF is a bit too high, I am choosing to retain ComparedToNational_Hospital return days for pneumonia patients because it is a primary predictor. This could be a poor choice, however! Every decision we make as data scientists can come with pitfalls.

## These look okay. Uncomment to see their variances.
train_df <- imputedTrain$ximp

train_df$State <- train$State
train_df$PredictedReadmissionRate <- train$PredictedReadmissionRate

numeric_cols <- sapply(train_df, is.numeric)
numeric_vars <- names(train_df)[numeric_cols]
variances <- sapply(train_df[, numeric_vars], var, na.rm = TRUE)

print(variances)
##                                               MRSA Bacteremia 
##                                                  2.785197e-01 
##                                Payment for pneumonia patients 
##                                                  3.131602e+06 
##                           Abdomen CT Use of Contrast Material 
##                                                  1.874940e+01 
##                             Death rate for pneumonia patients 
##                                                  6.333361e+00 
##                        Postoperative respiratory failure rate 
##                                                  8.238540e+00 
## Perioperative pulmonary embolism or deep vein thrombosis rate 
##                                                  5.319982e-01 
##                                      Composite patient safety 
##                                                  3.278698e-02 
##                                 Medicare spending per patient 
##                                                  5.766305e-03 
##                Healthcare workers given influenza vaccination 
##                                                  3.375239e+02 
##                                        Left before being seen 
##                                                  5.611618e+00 
##                            Venous Thromboembolism Prophylaxis 
##                                                  9.251090e+01 
##                                           Nurse communication 
##                                                  6.700260e+00 
##                                          Doctor communication 
##                                                  6.503217e+00 
##                                          Staff responsiveness 
##                                                  2.045795e+01 
##                                 Communication about medicines 
##                                                  1.789782e+01 
##                                         Discharge information 
##                                                  1.419928e+01 
##                                               Care transition 
##                                                  7.897291e+00 
##                                                   Cleanliness 
##                                                  1.834932e+01 
##                                                     Quietness 
##                                                  2.727363e+01 
##                                       Overall hospital rating 
##                                                  1.315431e+01 
##                                            Recommend hospital 
##                                                  2.266634e+01 
##                                            SurveyResponseRate 
##                                                  4.744972e-03 
##                        PaymentCategory for pneumonia patients 
##                                                  2.852180e-01 
##                                      PredictedReadmissionRate 
##                                                  3.799866e+00
cols2drop <- c("Payment for pneumonia patients", "Emergency department volume")

Question 16: [1 point]

Alternatively, we could use MICE (Multivariate Imputation with Chained Equations), which is a method that will model uncertainty along with the variables. Do a little research. How does MICE work? Include a citation to a source.

MICE (Multiple Imputation by Chained Equations) is a technique used to handle missing data. It works by building a model using the other variables to predict the missing values. This process is repeated multiple times, and the results are combined to give a final estimate that accounts for uncertainty. This approach is helpful because it provides a more accurate and robust way of handling missing data compared to just filling in one value. Citation: DataScience+ (2015). Imputing Missing Data with R MICE Package. Retrieved from https://datascienceplus.com/imputing-missing-data-with-r-mice-package/

3.5.4 Putting it alllllll back together…

We are almost done with imputation. The last thing we need to do is extract the imputed values from imputed_data$ximp, and add the State and target variable back. Then we get to wash, rinse, and repeat for the test set! Oh my!

3.5.4.1 Put the original State and target variables back
imputedTrain <- imputedTrain$ximp %>% 
  ## Put the original State and target variables back on
  mutate(State = train$State,
         PredictedReadmissionRate = train$PredictedReadmissionRate) 
3.5.4.2 Now do the imputation for test
# Remove encoded categorical variables for imputation, plus the target and state (which have no missing)
data_sans_encoded = test %>% 
  select(-PredictedReadmissionRate,
         -State,
         -contains("ComparedToNational"))

data_cat = test %>% 
  select(contains("ComparedToNational")) %>% 
  mutate_all(as.factor)

# Stitch them back together:
temp <- cbind(data_sans_encoded, data_cat)

# Impute missing values using missForest
imputedTest <- missForest(temp, 
                          variablewise = TRUE,
                          verbose = FALSE)
3.5.4.3 Load this if it will not run for you
load(file = "FY2024_data_files/imputedTest.Rdata")
3.5.4.4 Put the original State and target variables back on test
imputedTest <- imputedTest$ximp %>% 
  ## Put the original State and target variables back on
  mutate(State = test$State,
         PredictedReadmissionRate = test$PredictedReadmissionRate) 
3.5.4.5 Drop the features we determined we should not retain due to unstable imputation from BOTH train and test
imputedTrain <- imputedTrain[, !(names(imputedTrain) %in% cols2drop), drop = FALSE]
imputedTest  <- imputedTest[, !(names(imputedTest) %in% cols2drop), drop = FALSE]

Question 17: [1 point]

Quickly explore how well the imputation did by choosing at least one numeric variable and one of the categorical variables. Did the distributions or frequencies change drastically? You may find the summary() and table() functions to be fastest here.

# Check numeric variable: Medicare spending per patient
summary(imputedTrain$`Medicare spending per patient`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4900  0.9500  0.9900  0.9955  1.0400  1.6000
# Check categorical variable: ComparedToNational_MRSA Bacteremia
table(imputedTrain$`ComparedToNational_MRSA Bacteremia`)
## 
##   -1    0    1 
##   41 2029  140

The distributions for both the numeric and categorical variables after imputation appear reasonable. For the numeric variable Medicare spending per patient, the range remains plausible (0.49 to 1.60) with a median close to 1, suggesting no drastic changes. For the categorical variable ComparedToNational_MRSA Bacteremia, most hospitals remain at the national average (0), with fewer hospitals classified as better (-1) or worse (1). Overall, the distributions did not change drastically, indicating that the imputation process was successful and did not introduce major distortions in the data.

3.6 Another Fork in the Road

It’s time, yet again, to make a choice. If you opted to load the fully encoded dataset and everything has been done on that, you will run the first chunk. But if you decided that you wanted to wait to encode the TIME HAS FINALLY COME!

3.6.1 The Left Fork - No Encoding, Just Converting Data Back After missForest

Ultimately, we need to turn the factor features back into numbers for our next analyses but we had to turn them into factors for missForest. So, now we need to turn them back. Notice a bit of annoying trickery here. Because we had converted them to a factor type, in order to back-convert them correctly to our original ordinal encoding, we had to first convert to character and then to numeric. So, our conversion goes factor \(\rightarrow\) character \(\rightarrow\) numeric in order to get exactly what we need! Again, we must remember to perform on BOTH imputedTrain and imputedTest.

## Uncomment if this is your choice!
readyTrain <- imputedTrain %>%
mutate_if(is.factor, as.character) %>% 
mutate_if(is.character, as.numeric)
readyTest <- imputedTest %>%
mutate_if(is.factor, as.character) %>% 
mutate_if(is.character, as.numeric)

3.6.2 The Right Fork - Frequency Encoding

The time has FINALLY come to do frequency encoding, if that is the fork in the road you took all the way back in Section 1.3. Once again, we will be using a helper function that we load with the source() function like we did here. It will do the frequency encoding for us on the State variable, if it needs to be done. If you started from the fully encoded dataset, there is a conditional statement that will skip this for you.

## This calls the helper function in the associated .R file
source(file = "doFrequencyEncoding.R")

## If State is not already a number
if(!is.numeric(imputedTrain$State)) {
    ## This sets a list of COLUMN NAMES I want to frequency encode
    cols2encode <- c("State")
  
    ## This runs the function, which takes up to 4 arguments
    ## Open the source code to see what each argument does!
    ## It returns a LIST of frequency encoded dataframes.
    freqEncoded <- doFrequencyEncoding(train = imputedTrain,
                                       test = imputedTest,
                                       cols2encode = cols2encode, 
                                       quiet = FALSE)
    
    ## Lastly, extract the newly encoded, imputed, dataframes!
    readyTrain <- freqEncoded$train
    readyTest <- freqEncoded$test
}

Question 18: [1 point]

Because we are now frequency encoding AFTER data partitioning (we did ordinal encoding BEFORE the split for convenience), we absolutely must apply the same encoding map to the test set that we did to the train set. Explain why, and what should we do if some classes of the variable end up in the train that is not in the test or vice versa? (HINT: Look at the source code!)

It’s important to apply the same encoding map to the test set as we did for the train set to maintain consistency between the two datasets. If we create the encoding map using only the train data, we avoid leaking information from the test data into the model, ensuring fair evaluation of the model’s performance. If some categories are present in the train but not the test (or vice versa), we should map any unseen categories in the test to a default value like 0 or NA, rather than introducing new encodings or re-fitting the map. This keeps the model aligned with the training data’s structure and avoids data leakage or inconsistencies.

3.7 Transformation & Scaling

Ultimately, we know we will need to scale & center our data, as that is required for the analyses I have set out in our Analysis Plan. But before we get into that, it would be nice to know what realm of analyses in which we fall and, more importantly, do we need to consider transformation of our target in addition to the scaling & centering we plan to do on the entire dataset.

In other words: is our target variable approximately normally distributed?

Figure 7. Histogram of Raw Predicted Readmission Rate

Figure 7. Histogram of Raw Predicted Readmission Rate

Question 19A: [1 point]

Why is a histogram, or even a Q-Q plot, an insufficient way to assess whether a distribution is approximately normal?

A histogram or a Q-Q plot can give a visual sense of the distribution’s shape, but they aren’t precise or formal tests for normality. Histograms can be affected by bin width choices, and Q-Q plots can look normal even when there are subtle deviations from normality that could impact analysis. Also, they don’t provide a quantitative measure of how well the data fit a normal distribution. Formal tests like the Shapiro-Wilk test are more reliable for detecting non-normality because they provide statistical evidence, not just a visual impression.

Now Apply a Shapiro-Wilk test for normality using the shapiro.test() function. What does it indicate about your distribution? (Note: Shapiro tests are only reliable for \(N < 5000\), but it is fine to perform here.)

shapiro.test(imputedTrain$PredictedReadmissionRate)
## 
##  Shapiro-Wilk normality test
## 
## data:  imputedTrain$PredictedReadmissionRate
## W = 0.98544, p-value = 3.085e-14

W = 0.98544, p-value = 0.00000000000003085

3.7.1 Box-Cox Transformation

The Box-Cox transformation is a family of power transformations used to stabilize variance and make data more normally distributed (Box & Cox, 1964). Recall from our brief discussion in lecture that the Box-Cox transformation involves the parameter \(\lambda\), which when applied to \(y\) yields the transformation. It is defined as

\(y^\lambda = \frac{y^\lambda - 1}{\lambda}\) for \(\lambda \ne 0\).

When \(\lambda = 0\), it is the same as the \(\ln(y)\). Our lecture slides have more reference values for how to interpret \(\lambda\). But if \(\lambda \ne 0\), then this means that the transformation is only applicable to positive values. The end result is that it helps improve the performance of statistical models that assume normality.

3.7.1.1 There are multiple ways to apply the Box-Cox transform in R, but we are going to take advantage of the one in caret to make our lives easier, BoxCoxTrans().

bc_data <- BoxCoxTrans(readyTrain$PredictedReadmissionRate)
print(paste0("Box-Cox lambda =  ", bc_data$lambda))
## [1] "Box-Cox lambda =  -0.3"

We can see that the estimated \(\lambda = -0.3\), so approximately a square-root transform (see lecture slides for reference). I found that \(\lambda = -0.5\) in FY 2025, for those using that year’s data.

3.7.1.2 Now, apply the Box-Cox transformation to the train data:

readyTrain$bc_PredictedReadmissionRate <- predict(bc_data, readyTrain$PredictedReadmissionRate)

3.7.1.3 Assess how it did:

Figures 8a-b. Predicted Readmission Rate After Box-Cox Transformation

Can you see what it did to the distribution? look closely, if not. It’s subtle but discernible!!

Question 19B: [1.5 points]

Now repeat all of the steps to perform a Box-Cox transform on your own to the imputedTest data. Why must we (1) also transform the test data and (2) use the same \(\lambda\) as the training data?

We need to transform the test data so that it matches the format and scaling of the training data, ensuring the model sees the same type of data when making predictions. We must use the same lambda because it defines the specific shape of the Box-Cox transformation, and applying a different lambda to the test data would create inconsistencies, leading to unreliable model results.

# Apply the Box-Cox transformation to imputedTest using the lambda from training data
imputedTest$bc_PredictedReadmissionRate <- predict(bc_data, imputedTest$PredictedReadmissionRate)

# Histogram
ggplot(imputedTest, aes(x = bc_PredictedReadmissionRate)) +
  geom_histogram(alpha = 0.6, 
                 fill = "blue",
                 color = "blue") +
  theme_minimal() +
  labs(title = "Box-Cox Transformed Predicted Readmission Ratio - Test Data",
       y = "Frequency",
       x = "Predicted Readmission Ratio (Box-Cox Transformed)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Q-Q Plot
qqnorm(imputedTest$bc_PredictedReadmissionRate)
qqline(imputedTest$bc_PredictedReadmissionRate, col = "blue", lty = 2, lwd = 2)

NOTE At this point, you have both an non-transformed (PredictedReadmissionRate) and Box-Cox transformed (bc_PredictedReadmissionRate) version of the target in your data.

3.7.2 Centering & Scaling

Now that we have applied a Box-Cox transform to the target variable, we can center and scale the remainder of the variables. We had to perform the Box-Cox prior to centering because the data has to be positive for a Box-Cox transformation (unless we switch to a Yeo-Johnson transform (Yeo & Johnson, 2000)). Centering will likely make a lot of the data negative, plus we know we have negatives in our dataset purposefully.

It’s time to drop the original target, PredictedReadmissionRate, and center and scale everything INCLUDING the target! Here it is for the training set:

readyTrain <- readyTrain %>% 
  ## Drop the original target variable
  select(-PredictedReadmissionRate) %>% 
  ## Make anything that might still be a factor from encoding is back to number
  mutate_if(is.factor, as.numeric) %>%
  ## Center and scale everything!
  scale(center = TRUE, scale = TRUE) %>% 
  data.frame()

Question 19C: [0.5 points]

Do the same for your test data. Make sure to use readyTest and overwrite it.

readyTest <- readyTest %>%
  select(-PredictedReadmissionRate) %>%  # Drop the original target
  mutate_if(is.factor, as.numeric) %>%   # Convert any remaining factors to numeric
  scale(center = TRUE, scale = TRUE) %>% # Center and scale the data
  data.frame()                           # Convert to dataframe

Hint: Make sure to turn your centered, scaled matrix back into a dataframe! By default, it creates a matrix.

3.7.3 To save time, you are welcome to start from here after you’ve worked through the above steps.

load(file = "FY2024_data_files/readyTrain.Rdata")

load(file = "FY2024_data_files/readyTest.Rdata")

4 Assessing Multicollinearity & Feature Redundancy

As we finally, finally wrap-up pre-processing there is one final assessment we need to perform: multicollinearity. As I am certain you recall, multicollinearity is when we have high correlation between predictors (so much so, in fact, that it comes at the cost of any correlation with the target!). Recall from lecture that multicollinearity is when one or more of our predictors are moderately to highly correlated with each other. This can inflate our coefficients and thereby cause spurious associations through false positives (i.e., small p-values when we don’t actually have them!).

Here we will implement two common ways we assess multicollinearity: pairwise correlations and Variance Inflation Factors (VIFs).

4.1 Pairwise Correlation

Pairwise correlation is usually done as a Pearson or Spearman correlation. We will calculate a correlation matrix and look at the correlation of the variables within the training dataset to assess possible areas of multicollinearity. We can then visualize this matrix with a heatmap, similar to how we did with missingness, except now blue indicates a positive correlation, red indicates a negative correlation, and yellow indicates no (or zero) correlation. Darker colors indicate stronger correlations.

Figure 9. Correlation Heatmap of All Variables

Figure 9. Correlation Heatmap of All Variables

Question 20A: [1 point]

Where do you see the highest potential for multicollinearity? Why do you say that?

The highest potential for multicollinearity appears in the cluster of variables related to hospital experience metrics (e.g., Nurse communication, Doctor communication, Overall hospital rating, Recommend hospital, Care transition, etc.). These variables are clustered together and show high correlations (bright red squares), suggesting they are closely related and may provide overlapping information.

Which variable(s) seem to be most correlated with our target, bc_PredictedReadmissionRate?

The variable most correlated with bc_PredictedReadmissionRate appears to be staff responsiveness and nurse communication.

4.2 Fitting a model to estimate VIF

Another way we can assess multicollinearity is by fitting a multiple linear regression model and estimating the variance inflation factors, or VIF, which estimates how much the variance of an estimated regression coefficient increases if your predictors are correlated. Specifically, variance inflation can identify multicollinearity when the VIF is greater than ~4 and especially when it is above 10.

4.2.1 Fit a linear regression with the training data, using the Box-Cox transformed Predicted Readmission Rate as the target.

mod <- lm(bc_PredictedReadmissionRate ~ ., data = readyTrain)

4.2.1 Next, let’s calculate the VIFs using the vif() function that is part of the car package.

Note that I am not printing all of the VIFs, just the top 15 after sorting in descending order.

## Calculate the VIFs and put into a dataframe to print the table
vifResults <- vif(mod) %>% data.frame
## Change the column names of the dataframe
colnames(vifResults) <- c("VIF")
Table 7. Top 15 Variance Inflation Factors after multiple linear regression
VIF
Overall.hospital.rating 22.69
Recommend.hospital 14.77
Nurse.communication 8.35
Care.transition 8.02
Staff.responsiveness 5.26
Communication.about.medicines 4.59
Doctor.communication 4.19
Discharge.information 3.48
Composite.patient.safety 2.74
Quietness 2.35
ComparedToNational_Composite.patient.safety 1.87
Death.rate.for.pneumonia.patients 1.85
Cleanliness 1.79
ComparedToNational_Death.rate.for.pneumonia.patients 1.76
Postoperative.respiratory.failure.rate 1.56

Question 20B: [1 point]

Does multicollinearity seem to be an issue in this dataset and, if so, among which variables? Why do you come to this conclusion?

HINT 1: What was used to calculate Overall Hospital Rating? Does this help us better understand why the VIF is so large?

HINT 2: Is, for example,Payment Category for pneumonia patients created from Payment for pneumonia patients? [These are Medicare payments]? Which one do you think we should keep amd why?

Yes, multicollinearity is an issue in this dataset, especially for variables like Overall hospital rating, Recommend hospital, and Nurse communication, which have high VIF values. This makes sense because Overall hospital rating is likely calculated using other patient experience variables, like communication and staff responsiveness, causing overlap. For the payment variables, Payment Category for pneumonia patients is derived from Payment for pneumonia patients, so it would make sense to keep the continuous Payment for pneumonia patients for better granularity in modeling.

Which variable(s) seem to be most highly correlated with our target? Is it a positive or negative correlation? Does it make sense?

Based on the correlation matrix you previously shared, Nurse communication and Staff responsiveness appear to have the strongest correlations with the target variable bc_PredictedReadmissionRate. The correlations are negative, meaning that better communication and responsiveness are associated with lower readmission rates, which makes sense: when patients feel heard and cared for, they are less likely to return to the hospital.

4.3 Dropping the most collinear variables

Even though we plan to proceed with an elastic net (which is much more robust to multicollinearity), some of these VIFs are too high for comfort. We also want to pause and ask - were any of the variables used to make other variables? (The answer is “yes”!). Those are excellent candidates to scrub too; we want to use our domain knowledge and what our stakeholders are asking for to help us decide which ones to keep.

Because the goal is a deployable predictive model, we likely want something more granular than most of the ComparedToNational features can give us, with the exception of our key predictor, ComparedToNational_Hospital return days for pneumonia patients. I will remind you of what we explored in Section 2 with this variable; not only that, we can see how how highly correlated it is with our target!.

4.3.1 Drop variables with VIF over 5 from both the training and testing data

## Drop those with VIF over 5
vifOver5 <- vifResults %>% filter(VIF >=5) %>% rownames()
print(vifOver5)
## [1] "Nurse.communication"     "Staff.responsiveness"   
## [3] "Care.transition"         "Overall.hospital.rating"
## [5] "Recommend.hospital"
## Now drop them
readyTrain <- readyTrain %>% select(-any_of(vifOver5))
readyTest <- readyTest %>% select(-any_of(vifOver5))

4.3.2 Now drop all of the ComparedToNational and PaymentCategory features except ComparedToNational_Hospital return days for pneumonia patients

readyTrain <- readyTrain %>% 
  ## Rename the column we want to keep
  rename(Hospital.return.days.for.pneumonia.patients = ComparedToNational_Hospital.return.days.for.pneumonia.patients) %>% 
  ## Drop all the other columns 
  select(-contains("Compared"), -contains("PaymentCategory"), -SurveyResponseRate)

readyTest <- readyTest %>% 
  ## Rename the column we want to keep
  rename(Hospital.return.days.for.pneumonia.patients = ComparedToNational_Hospital.return.days.for.pneumonia.patients) %>% 
  ## Drop all the other columns 
  select(-contains("Compared"))

5 Unsupervised Learning Methods: Segmentation Analysis

Predictive modeling is our ultimate goal, but let’s pretend our stakeholders have expressed a desire to better understand how the hospitals relate to one another and whether there are any general patterns we should be interested in for enhanced interpretation. After all, a prediction is one thing, but understanding the why is still usually important!

Thus, we have opted to undertake a Segmentation Analysis to see if there is any underlying segments, or clusters, of hospitals based on the data we have. We will use the unsupervised methods of Principal Component Analysis (PCA) and \(k\)-means clustering do this; it will help us to determine if there are broader classifications of our hospitals that we need to report back to the advocacy group.

5.1 Principal Component Analysis

First, we apply a PCA on the data and display the principal components. Note that we have the same number of principal components as we do variables in the data set! The cumulative proportion of variance adds to 100% by our last principal component; but we’re really more interested in the portion of variance explained by EACH of the principal components. This can help us figure out just how many PCs are important for explaining the most variance. For example, if we were interested in feature reduction, we could use the top PCs that explain, say, some portion of the variance that we wish to retain.

Question 21: [2 points]

  1. At what principal component do we surpass 80% of explained variance? (HINT: look at the cumulative variance row rather than tha proportion of variance row).

We surpass 80% of explain variance at PC11 - principal component 11. We see that up through PC10, the cumaltive explained variance is just below 80%. Once we are at PC11, the cumultive proportion jumps to ~83.98%, meaning PC11 is the first component to push us past the 80% threshold. This indicates to us that the first 11 principal components together explain more than 80% of the total variance in the dataset.

pca <- prcomp(readyTrain)
summary(pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0083 1.3531 1.23770 1.15277 1.11902 0.98837 0.95617
## Proportion of Variance 0.2241 0.1017 0.08511 0.07383 0.06957 0.05427 0.05079
## Cumulative Proportion  0.2241 0.3258 0.41089 0.48472 0.55429 0.60856 0.65935
##                            PC8     PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.93867 0.92718 0.8807 0.85563 0.81030 0.73429 0.68600
## Proportion of Variance 0.04895 0.04776 0.0431 0.04067 0.03648 0.02995 0.02614
## Cumulative Proportion  0.70830 0.75606 0.7992 0.83983 0.87631 0.90626 0.93240
##                           PC15    PC16    PC17    PC18
## Standard deviation     0.65207 0.58828 0.50527 0.43607
## Proportion of Variance 0.02362 0.01923 0.01418 0.01056
## Cumulative Proportion  0.95603 0.97525 0.98944 1.00000
  1. You may also notice that this is always ordered from most proportion of variance on the left (thus is PC1 defined) to our final principal component on the right. What percent of variance is explained by just PC1? What percent by the final PC? Is the attributable portion of the variance in PC1 high, medium, or low based on some quick research or prior knowledge?

The percent of variance explained by just PC1 is 22.41% and the percent of variance explained by the final principal component (PC18) is 1.06%. The attributable portion of variance in PC1 — over 20% — is considered high as with PCA, it’s typical for PC1 to capture the largest share of variance. This is because it finds the single axis that explains the most variability in the data. A value over 20% is considered substantial because PC1 alone explains a large portion of the overall variance, while each of the later components, individually contributes much smaller percentages — often less than 10%.

Let me know if you want this phrased even more formally for a report!

  1. We use scree plots to graphically determine how many principal components to retain and how many principal components best explain the variance of the data. The \(x\)-axis displays the principal components (PC1, PC2, etc.), while the \(y\)-axis shows the eigenvalues (or proportion of variance explained) for each component. The elbow method we discussed in class is the point where the slope starts to flatten or the point at which the percentage of variance explained seems to level off. If we were doing this for feature reduction, we might think of this as approximately how many principal components would be needed to explain the total variation in the training data. However, it is a heuristic method, so it can sometimes be a little uncertain from the plot alone what the optimal number of PCs is. Let’s take a look at the top-15 PCs on a scree plot now:
Figure 10. Scree Plot of First 15 Principal Components

Figure 10. Scree Plot of First 15 Principal Components

How many PCs appear optimal to you based on the scree plot?

Based on the scree plot, the optimal number of principal components appears to be around 4 to 5 PCs. This is where the elbow or bend in the plot occurs (after PC4 or PC5, the slope flattens) Adding more components past 5, explains only small additional amounts of variance. As a result, retaining these top components would capture most of the meaningful variation in the data without adding too much noise or complexity.

  1. People often mis-write scree (I’ve even accidentally done it) as SCREE, as if it’s an acronym like AUROC or ROC or even PCA. But it’s not. Why is it called a scree plot? (Hint: It may not be what you expect!)

It’s called a scree plot and not ‘SCREE’ because the pattern of the graph resembles a scree slope which is the pile of loose rocks or debris that collects at the base of a cliff or mountain. With its roots from geology, a scree slope shows a steep drop followed by a long ‘tail off’, just like how the explained variance sharply drops after the first few principal components and then levels out (as seen in the chart above). The name pulls from this visual resemblance and not an acronym which would be inferred if it was otherwise capitlizaed.

5.1.2 Variable loadings and contributions

Next, let’s explore how much the variables are contributing (this is based on the weight of their rotations). In PCA, we are interested in the loadings, i.e., the linear combination weights (coefficients) whereby unit-scaled components define or “load” a variable. Loadings help us interpret principal components.

You may have noticed we used the prcomp() function above; prcomp() will return the loadings in the variable $rotation, which contains a matrix of variable loadings. These loadings have been determined from the eigenvectors, which without getting into what that is (because we’d have to take a departure into linear algebra), suffice it to say that they are how we are measuring the direction and magnitude during each subsequent rotation of the principal components as we measure the variance explained.

Additionally, we are going to color this by our grouping variable and key predictor in the dataset, Hospital.return.days.for.pneumonia.patients. Recall that when we explored this earlier this variable is the performance of each hospital in terms of its pneumonia-related return-to-hospital days relative to the national average. In other words, hospitals that were “Worse than average” had readmitted pneumonia patients with longer-than-average stays.

Figure 11. PCA Biplot Showing the Top Feature that Contributes to Explained Variance

Figure 11. PCA Biplot Showing the Top Feature that Contributes to Explained Variance

This biplot (Figure above) shows that, on principal components 1 and 2 (the PCs that account for the most variance in the training data) there is subtle but noticeable grouping of hospitals in terms of how long their readmitted pneumonia patients stayed in the hospital. I have chosen to display the #1 feature that contributes to that variance, Communication.about.medicines.

Each quadrant on a PCA biplot corresponds to a different directional influence of the original variables. E.g., because it is in the top-right quadrant I where both PC1 and PC2 are positive, this suggests that communication about medicines are positively associated with variables that point into the same quadrant. Since it seems to correspond with where the “better” hospitals are, this suggests that these hospitals are made somewhat distinctly “better” at readmitted hospital stay lengths because of correlations with communication about medicines. (NOTE: If you look at the 2025 data, the bipolot is effectively inverted but the conclusions seem to be the same in 2025 as well).

A general rule of thumb is:

  1. Top-right Quadrant I
  • PC1 and PC2 are both positive
  • Thus points in this quadrant are positively associated with variables pointing into this quadrant
  • Indicates high values for variables with arrows in this quadrant
  1. Top-left Quadrant II
  • PC1 is negative, PC2 is positive
  • Thus points here are associated with high values of variables pointing to the left and top
  • These observations contrast with those in the right-side quadrants on PC1
  1. Bottom-left Quadrant III
  • PC1 and PC2 are both negative
  • Thus points here are negatively associated with variables pointing into the other quadrants
  • These points could represent a distinct subgroup or pattern with behavior opposite to other points.
  1. Bottom-right Quadrant IV
  • PC1 is positive, PC2 is negative
  • Thus points here are associated with variables pointing into this quadrant
  • May indicate a contrast with Quadrant II on PC2

Question 22: [1 point]

Play around with the graphical parameters in the biplot, especially those I’ve marked with comments. Try adding, for example, more variables to display rather than just the top contributor. NOTE that if you find variables in the same quadrant together, this means that they are correlated. What variable(s) can you find that contrast with Communication.about.medicines?

The variable ‘Composite.patient.safety’ contrasts with ‘Communication.about.medicines’ because it points in nearly the opposite direction (near Quadrant III), whereas Communication.about.medicines points more toward Quadrant I, in the top right. This suggests that as hospitals score higher on communication about medicines (where there is a positive association), they tend to score lower on composite patient safety, indicating these two variables may capture different or even opposing aspects of hospital performance.

code added here to display the top 5 contributing variables in the biplot.

fviz_pca_biplot(pca, palette = c(“maroon”, “cadetblue”, “gold”), label = “var”, geom = “point”, geom.var = “text”, addEllipses = TRUE, ellipse.alpha = 0.5, col = “black”, col.var = “black”, habillage = returnHospitalComparison, select.var = list(contrib = 5), ## <- updated to show top 5 contributors repel = TRUE)

Now, look at the contributions to PCs 1 and 2 (Figure Below) from the variables in the training dataset. Anything above the dotted red-line contributes significantly to the overall explanation of variance, and anything below the dotted line does not. Does anything about this plot stand out to you?

HINT 1: What do the top contributing variables all have in common? In other words, did they all come from the same data source?

HINT 2: Where does the target, PredictedReasmissionRate, fall among the significant variables?

The top contriubators to the overall explaination of variance are: Communication about medicines, Doctor communication, Composite patient safety, Discharge information and Quietness as these are all above the dotted line. Notably, these variables are all related to patient experience and safety measures. These largely come from the patient satisfaction and safety survey data source. We can infer that patient communication, responsiveness, and safety practices are the strongest drivers of the variance in PC1 and PC2 as they dominate over their operational or administrative metric counterparts. The target variable, PredictedReadmissionRate, is just below the red significance line, meaning that while it contributes somewhat to the variation in PC1 and PC2, it’s not among the strongest drivers. This tells us that even though readmission is the outcome of interest, the factors explaining most of the variance in the dataset are not directly the readmission rate itself, rather patient experience and safety factors.

Figure 12. Variables that Significantly Explain Variance.

Figure 12. Variables that Significantly Explain Variance.

5.2 k-means Clustering

\(k\)-means is a centroid-based clustering algorithm, where a “centroid” is the geometric center of an object often measured by Euclidean distance. In the algorithm, the distance between each data point and a centroid is calculated by randomly grabbing data; these distances are then used to assign the data point to a cluster. The goal is to identify the optimal \(k\) number of groups in a dataset by minimizing the distance of each datapoint to a respective centroid.

5.2.1 Visualizing Clusters

Although multiple methods exist to determine the optimal number of clusters, one commonly employed is heuristic, i.e., it isn’t really computational. You will either choose the optimal clusters based on a set of options of \(k\) graphically or you can once again employ the “elbow” method again, as you did with PCA. Let’s start by visually inspecting the effect of different \(k\) clusters on PC1 and PC2 of the dataset. I have chosen to test \(k\) groups of 2 through 9.

Figure 13. Biplots showing possible k clusters

Question 23A: [0.5 points]

Which number of clusters, \(k\), do you think has the best explanatory power? Why? Is it is hard to tell?

It is hard to tell which number of clusters ‘k’ has the best explanatory power. Specifically, as k increases beyond 5, the clusters look noisy and the visualization is cluttered. From what I can interept, k=4 or k=5 may have the best explanatory power as they capture finger groupings but don’t have as much overlap. Implementing the elbow method can help to narrow down which ‘k’ is in fact the most optimal.

5.2.3 Within-sum-of-squares (WSS) elbow method

Next, let’s employ the ‘elbow’ method again, this time to look at when the within sum of squares drops off rather than the percent variance, as was done in PCA.

Figure 14. WSS Scree plot for k-means

Figure 14. WSS Scree plot for k-means

Question 23B: [1 point]

What is the within sum of squares and what exactly is it measuring here? By the elbow method, which \(k\) is the optimal number of clusters? Did it agree with your choice from the other heuristic method?

The within sum of squares is a measure of how tightly the data points within each cluster are grouped around their cluster’s centroid. It sums up the squared distances between each point and its assigned cluster center across all the clusters. In this case, it is measuring the compactness of clusters and the elbow method gives us clarity to which point adding more clusters only yields minimial improvements. Looking at the WSS elbow plot, the optimal number of clusters is k=2, since that’s where we see the first major elbow or bend — after this point, adding more clusters leads to only minor improvements in WSS. This does not agree with my earlier choice but this one is more accurate and precise.

5.2.3 Silhouette method

Lastly, we will apply the silhouette method, which measures the average silhouette width, a combination of (1) how similar any given point is to its own cluster (also called cohesion) and (2) how different that same point is from other clusters (also called separation). Silhouette scores can range from \(-1 \rightarrow +1\), with larger values indicating better clustering (i.e., more cohesive clusters that are separated from each other). Thus, a silhouette score of 1 indicates perfect separation and cohesion whereas a silhouette of 0 indicates overlapping clusters.

What we can do is iteratively test our possible \(k\) values of 2 to 9 and a silhouette score will be calculated for each \(k\) tested. The highest silhouette score indicates the optimal \(k\).

Figure 14. Silhouette Plot for k-means

Figure 14. Silhouette Plot for k-means

Question 23C: [0.5 points]

Are you surprised at the optimal \(k\) from the silhouette method? Does it seem to agree with what you’ve seen with the WSS elbow method? (Hint: agreement generally suggests evidence that you have found the optimal \(k\)).

No, I am not surprised that the silhouette method suggests k=2 as the optimal number of clusters. This matches the result from the WSS elbow method, where the most noticeable “elbow” or bend was also at k=2. When both methods agree, it strengthens the evidence that we have found the best clustering solution, giving us more confidence that k=2 is in fact meaningful.

Lastly, let’s extract the optimal \(k\) from the silhouette method.

## Extract optimal k from silhouette method; first get the cluster data
clusterData <- fviz_nbclust(readyTrain, kmeans, method = "silhouette")$data
## Then extract the maximum y
k <- as.numeric(clusterData$clusters[which.max(clusterData$y)])

5.3 Segmentation Analysis

As we said before, the goal of segmentation analysis is to broadly cluster our hospitals by their features to help us characterize them.

Question 24: [1 point]

Let’s explore the clusters - segments - of the hospitals based on their average values from the original dataset (we start from imputedTrain, which hadn’t yet been transformed, centered, or scaled, BUT the multicollinear variables also haven’t been dropped either). Your task is to add comments to this code chunk.

### Run k-means clustering using the chosen optimal number of clusters (k)
kmeansResult <- kmeans(readyTrain,  
                         centers = k,      #set number of k clusters to optimal k value 
                         nstart = 50,      
                         iter.max = 1000,  
                         algorithm = "MacQueen")
#Prepare the original imputed dataset for analysis and visualization
kmeansTrain <- imputedTrain %>% 
  ## Rename the column we want to keep
  rename(Pneumonia.Hospital.Return.Days = `ComparedToNational_Hospital return days for pneumonia patients`) %>% 
  #Drop variables where VIF>5, indicating high multicollinearity and irrlevent categorical codes 
  select(-any_of(vifOver5), -contains("Compared"), -contains("PaymentCategory"), -SurveyResponseRate) %>% 
  #add the cluster assignments from k-means
  #recode pneumonia variable into numeric form 
  mutate(Cluster = kmeansResult$cluster,
         Pneumonia.Hospital.Return.Days = as.numeric(if_else(Pneumonia.Hospital.Return.Days == "1", 1,
                                                     if_else(Pneumonia.Hospital.Return.Days == "0", 0, -1)))) %>% 
  #Clean up column names to make them analysis friendly 
  clean_names()

## Using % contribution graph, the order of the top 10:
pcaImportantVars <- factor(c("Communication.about.medicines",
                       "Doctor.communication", 
                       "Composite.patient.safety",
                       "Discharge.information",
                       "Quietness",
                       "Postoperative.respiratory.failure.rate",
                       "Cleanliness",
                       "Periop.Rate.Of.Embolism.Or.Thrombosis",
                       "Predicted.Readmission.Rate",
                       "Pneumonia.Hospital.Return.Days"))

#Convert variable names to lowercase 
pcaImportantVars <- tolower(pcaImportantVars)
#Replace (.) with (_) in variable names 
pcaImportantVars <- gsub("\\.", "_", pcaImportantVars) %>% 
#Convert to consistent title case 
  to_any_case("title")

#Create a boxplot where we compare key variable distributions across clusters 
kmeansTrain %>% 
  #Rename selected variables to align with plot labels 
  rename(Predicted.Readmission.Rate = predicted_readmission_rate,
         Periop.Rate.Of.Embolism.Or.Thrombosis = perioperative_pulmonary_embolism_or_deep_vein_thrombosis_rate) %>% 
  #Ensure column names follow consistent title casing 
  clean_names(case = "title") %>% 
  #Select only cluster and important variables for plotting 
  select(Cluster, any_of(pcaImportantVars)) %>% 
  #Reshape data into long format 
  pivot_longer(-1, names_to = "Measure", values_to = "Value") %>% 
  #Reorder factor levels to match original PCA variable important order 
  filter(Measure %in% pcaImportantVars) %>% 
   mutate(Measure = factor(Measure, levels = to_any_case(c("Communication.about.medicines",
                        "Doctor.communication", 
                        "Composite.patient.safety",
                        "Discharge.information",
                        "Quietness",
                        "Postoperative.respiratory.failure.rate",
                        "Cleanliness",
                        "Periop.Rate.Of.Embolism.Or.Thrombosis",
                        "Predicted.Readmission.Rate",
                        "Pneumonia.Hospital.Return.Days"), case="title"))) %>% 
  #Plot the boxplots with ggplot, color by cluster 
    ggplot(aes(x = Cluster, y = Value, fill = as.factor(Cluster))) +
    geom_boxplot(alpha=0.75) +
    scale_fill_manual(values = c(skittles[1], skittles[6])) +
    labs(title = "Cluster Means",
         x = "",
         fill = "Cluster") +
    theme_classic() +
    theme(legend.position = "bottom",
        strip.text = element_text(size=8),
        legend.text = element_text(size = 8),
        legend.title = element_text(size = 10),
        title = element_text(size = 12, color = "maroon")) + 
    #Create seperate facet panels for each variable 
    facet_wrap(~ Measure, 
             scales = "free_y",
             ncol = 2)
Figure 16. Boxplot Comparisons of Important Variables by Cluster

Figure 16. Boxplot Comparisons of Important Variables by Cluster

#Summarize the key variables by cluster group in a summary table
kmeansTrain %>% 
  #Rename the columns to be consistent with summary table labels 
  rename(Predicted.Readmission.Rate = predicted_readmission_rate,
         Periop.Rate.Of.Embolism.Or.Thrombosis = perioperative_pulmonary_embolism_or_deep_vein_thrombosis_rate) %>% 
  #Clean column names to matxh summary formatting 
  clean_names(case = "title") %>% 
  #Select cluster and key variables 
  select(Cluster, any_of(pcaImportantVars)) %>% 
  #Create a summary table comparing mean and median across clusters
  tbl_summary(by = "Cluster",
              statistic = list(all_continuous() ~ c("{mean} ({median})"))) %>% 
  modify_header(label ~ "Variable",
                all_stat_cols() ~ "**Cluster {level}**") %>% 
  modify_caption("Table 8. Summary Statistics of Important Cluster Variables")
Table 8. Summary Statistics of Important Cluster Variables
Variable Cluster 11 Cluster 21
Communication About Medicines 76.5 (76.0) 70.9 (71.0)
Doctor Communication 91.03 (91.00) 87.68 (88.00)
Composite Patient Safety 0.98 (0.96) 1.06 (1.01)
Discharge Information 87.0 (87.0) 82.3 (83.0)
Quietness 83.2 (83.4) 77.1 (77.0)
Postoperative Respiratory Failure Rate 8.69 (8.53) 9.93 (9.25)
Cleanliness 86.4 (86.8) 82.1 (82.6)
Periop Rate of Embolism or Thrombosis 3.55 (3.52) 3.73 (3.59)
Predicted Readmission Rate 15.77 (15.67) 17.57 (17.38)
Pneumonia Hospital Return Days

    -1 188 (14%) 430 (50%)
    0 976 (72%) 408 (47%)
    1 185 (14%) 23 (2.7%)
1 Mean (Median); n (%)

HINT: If these are back to the original values, this is more interpretable for our stakeholder. We will now read this just like a typical boxplot; for example, median (the middle line) Doctor Communication seems higher in Cluster 1 than it does in Cluster 2 with only some overlap in the distributions (the boxes and whiskers don’t really overlap each other). The table confirms this; the median Doctor Communication score is 91% in Cluster 1 but only 88% in Cluster 2.

Question 25: [2 points]

Notice, for example, that Cluster 1 hospitals seem to include the hospitals that had the lowest pneumonia-related readmission rates, with the average rate ~2% higher in Cluster 2 hospitals. Thus, for our stakeholder, we might decide to rename Cluster 1 as “Highest Performing Hospitals” or something along those lines. Notice also that this cluster had more below national average return hospital days (a “1”). This means that they had more readmitted pneumonia patients who spent less time in the hospital upon readmission than the national average.

Look at how else you might broadly characterize - or segment - these hospitals. Your job is to make a table for your client, summarizing each of the two clusters with FOUR primary criteria. Make sure to give each cluster an informative name as I did; you’re welcome to rename Cluster 1 if you think of a catchier name than I chose!

After you come up with names of the two clusters, you will then come up with at least FOUR characteristics per cluster summarizing what you see in either the boxplots or table. You may choose to focus on different attributes for different clusters. Fill in the code for the table below. I start you out with two examples for Cluster 1, but feel free to add more!

Table 9. Hospital segmentation analysis: two types of broad groupings identified.
Top Defining Attributes Description
1: Highest Performing Hospitals
Predicted Pneumonia-related hospital readmissions These hospitals have a lower readmission rate, suggesting that hospitals in this cluster may be better at diagnosing and treating pneumonia
Return hospital days due to pneumonia Readmitted patients spend below the national average number of days in hospital, suggesting these hospitals are able to more quickly treat and stabilize pneumonia patients
Strong patient communication Higher scores on doctor communication, communication about medicines, and discharge information suggest these hospitals prioritize clear patient communication
Better patient experience Higher ratings for cleanliness and quietness suggest a stronger overall patient experience
2: Underperforming / At-Risk Hospitals
Predicted Pneumonia-related hospital readmissions These hospitals have a higher readmission rate, suggesting they may struggle more with pneumonia outcomes
Return hospital days due to pneumonia Readmitted patients spend at or above national average days in hospital, indicating potentially longer recovery needs
Weaker patient communication Lower scores on communication-related metrics suggest gaps in how information is conveyed to patients
Higher clinical complications Higher postoperative respiratory failure rates suggest more clinical challenges in this group

5.4 Final Remarks on Segmentation

We don’t technically have to stop our segmentation analysis here, but we will so that we can focus on other analyses. One thing we might choose to do in the future, for example, is to choose a classification machine learning algorithm to test the predictions based on these clusters.

Question 26: [1 point]

Make a recommendation for our client for 1-2 machine learning analyses your team would choose to use to test whether these clusters can be used for robust predictions of hospital performance for pneumonia patients. How would you know whether or not your predictions were robust? What would you look for or compare?

I would advise the client to uselogisitic regression and random forest as two means of machine learning analyses, in order to test whether these clusters can be used for robut predictions of hospital performance for pneumonia patients. Logistic regression tests whether cluster member 1 or 2 predicts a binary outcome, in this case, high vs. low pneuominia readmission rates. It also gives us coefficents and helps assess how strongly the clusters explain variation in our key outcome. Random Forest can be applied as it handles complex interactions and onlinear patterns - in this case it can provide variable importance scores so we can see if cluster membership is a top predictor.

6 Supervised Learning Methods

We are going to perform two supervised learning methods: an Ordinary Least Squares (OLS) regression followed by an Elastic Net.

6.1 Ordinary Least Squares (OLS) regression

Earlier in Section 4 when we assessed multicollinearity, we fit an OLS regression model so that we could calculate variance inflation factors (VIFs). However, because OLS regression is so incredibly sensitive to multicollinearity, we should re-run it now that we’ve removed some of the most redundant variables.

Let’s model that now, using the readyTrain dataset and plot some diagnostic plots. to assess the key assumptions of an OLS linear regression. The four plots are diagnostic plots for the primary assumptions of a linear regression:

  1. Linearity between the outcome and the predictors

  2. Normally distributed residuals

  3. Homosecadasticity (equality of variances) in the residuals

  4. No high leverage/influence points (no significant outliers)

Figure 17. OLS Regression Assumption Diagnostic Plots

Question 27A: [1 point]

What is your general assessment of our OLS assumptions? Met or unmet? Why?

Overall, the general assessment of our OLS assumptions are met. Since the residuals vs. fitted plot show no real patterns/curves, it suggests the relationship between predictors and outcome is primarily linear. The Q-Q plot shows the points following the diagnol line and the Shapiro-Wilk test (p=0.067) confirms no sigifnicant departure from normality. The scale-location plot shows a fairly random spread, suggesting constant varianc across fitted values and the Cook’s distance plot shows no extreme influential points. The key OLS assumptions in this case are sastified.

Note: If you are not entirely sure about the Q-Q plot, you can run a Shapiro-Wilk test on the residuals using the following code:

shapiro.test(residuals(mod))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod)
## W = 0.99862, p-value = 0.06735

The most important OLS assumption for Elastic Net is linearity of the relationship between predictors and the response. The other, stricter OLS assumptions don’t matter as much because of regularization.

Do you find support for linearity here? Why or why not?

I do think there is reasonable support for linearity here. Looking at the residuals vs. fitted plot, we see that the residuals are scattered randomly around the horizontal line at zero without obvious patterns or curves - this suggests that the relationship between the predictors and the outcome is approximately linear. Elastic net relies on the core assumption that there is a linear relationship between predictors and the outcome variable. We can conlude that since the diagnostic plot show no real curvature or non-linear patterns, the linearity assumption is sufficently met.

  1. Let’s now assess how well our model fits with the adjusted \(R^2\).
## [1] 0.39427

Question 27B: [1 point]

It is a surprisingly high (although it isn’t stellar). You may need to do some outside research, but explain in a sentence or two why, generally speaking, we use adjusted \(R^2\) and not the unadjusted one (confusingly labeled as multiple r-squared in the summary() output below).

We used the adjusted R^2 because it accounts for the number of predictors in the model and it penalizes for adding variable that do not improve the model’s explanatory power. The unadjusted R^2 will eithr increase or stay the same when more predictoes are added (even if they do not yield meaningful outcomes). Since R^2 increases only when new predictors improve the model, it is a more reliable measure of model fit, especially when comparing models with different numbers of predictors.

  1. Find the best model.

Use the step() function to do a backward regression to find our most parsimonious model. Stepwise regression drops terms that are not significant or important, thereby only keeping the features that contribute to the overall explanation of variance in predicted readmission rate.

## Perform a backward, stepwise regression to find the most parsimonious model
bestMod <- step(mod, 
            direction = "backward", 
            trace = FALSE)

Lastly, let’s take a peek at the results. We will use a package called stargazer to help us visualize the results in a nicely organized way.

## Print a table using stargazer
stargazer(bestMod, 
          type = "html",
          title = "Table 10. Parsimonious OLS Regression Results.")
Table 10. Parsimonious OLS Regression Results.
Dependent variable:
bc_PredictedReadmissionRate
MRSA.Bacteremia 0.039**
(0.017)
Abdomen.CT.Use.of.Contrast.Material 0.045***
(0.017)
Death.rate.for.pneumonia.patients -0.154***
(0.017)
Medicare.spending.per.patient 0.167***
(0.018)
Healthcare.workers.given.influenza.vaccination 0.044**
(0.018)
Left.before.being.seen 0.062***
(0.017)
Venous.Thromboembolism.Prophylaxis 0.101***
(0.018)
Doctor.communication -0.074***
(0.027)
Discharge.information -0.053**
(0.024)
Cleanliness -0.053***
(0.020)
Quietness -0.077***
(0.022)
Hospital.return.days.for.pneumonia.patients -0.396***
(0.018)
State 0.110***
(0.019)
Constant 0.000
(0.017)
Observations 2,210
R2 0.398
Adjusted R2 0.395
Residual Std. Error 0.778 (df = 2196)
F Statistic 111.824*** (df = 13; 2196)
Note: p<0.1; p<0.05; p<0.01

Lastly, let’s assess how good a predictive model our OLS regression model is. We will use the predict() function to first fit our most parsimonious model bestMod to the test data:

## Make the predictions
predictions <- predict(bestMod, newdata = readyTest)

The Root Square Mean Error (\(RMSE\)) is the square root of the average of the squared differences between predicted and actual value, such that \(RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n\cdot(y_i-\hat{y}_i)^2}\) where \(y_i\) is each observed value, \(\hat{y}_i\) is each corresponding predicted value, and \(n\) is the number of observations. NOTE: Recall that a residual is \(y_i-\hat{y}_i\) - the distance between every point and its predicted value! It’s also important to realize that the \(RMSE\) gives more weight to large errors because of squaring. Thus, it’s sensitive to outliers.

Question 27C: [1 point]

Can you interpret what the RMSE tells your client here? Why or why not? What about the \(R^2\)?

Hint 1: Remember that the outcome here is Box-Cox transformed. Would you need to undo the transformation to be able to assess what the RMSE (or \(R^2\)) is actually telling us? Recall that the Box-Cox \(\hat{y} = \frac{1}{y^\lambda}\) where our \(\lambda = -0.3\) AND we also centered and scaled these data.

Hint 2: Is it practical to undo both transformations? How does that hinder our ability to make an interpretation for our client?

We cannot clearly interpret what the RMSE tells the client, in order to make it interpretable we would have to undo the previous transformations. Right now, the RMSE only tells us how well the model fits relative to the scaled and transformated data. This is food for internal model evaluation but not necessiarly for stakeholder analysis. For R^2 as well, we can use this to compare models internally, but to present meaningful results to the client we would have to back-transform predictions to the original scale, or we would focus on relative feature important. However, for the client, it would not be very practical to undo both the Box-Cox transformation the centering and scaling for RMSE interpretation, which would be necessary otherwise. This would ultimately hinder interpretation because the transformed RMSE would not align with real-world units (such as percent readmission). This would ultimately limit the ability to make straightforward and actionable statements, especially without needing to complex and potentially risky back-transformations.

Lastly, let’s graph the effect predictive ability of our OLS model by comparing the relationship between the true values of bc_PredictedReadmissionRate from the readyTest dataset to the predictions we made using the model on the testing data.

Figure 18. Predictions Plot for OLS Regression

Figure 18. Predictions Plot for OLS Regression

NOTE that, while there is an obvious linear trend, it is weak. There is a lot of error around the predictions from the OLS regression!

6.2 The Elastic Net

Perhaps the seemingly low \(R^2\) is because we still have too much multicollinearity or overfitting. So, as we discussed in lecture, we can address this using regularized regressions, such as the Elastic Net, which is more robust to multicollinearity by assessing penalties for non-zero variance. Typically, we would likely conduct a Ridge and/or LASSO before running an Elastic Net, but since we already know that (1) we have a fairly high amount of multicollinearity in our original readyTrain dataset and (2) we know we want to be able to perform feature reduction and figure out which features are important predictors of hospital readmissions, we’re going to jump straight into Elastic Net, which performs both.

6.2.1 The Elastic Net Unpacked

In general, regularization is a technique that applies a penalty term to a cost function of a machine learning model, like the OLS regression. The reason is to discourage overfitting. This penalty term constrains the model’s coefficients, which limits how flexible they are during training. Why?

The more flexible the coefficients, the less likely they will “learn” new data, i.e., the more generalizable they are! Thus, by applying this penalty (constraint), it improves the model’s performance on “unseen” or new data.

As mentioned in lecture, Elastic Net regression is a general regularization technique that combines the regularization techniques of Ridge and LASSO to help us perform both feature selection (i.e., significant coefficients) and feature reduction (importance). The elastic net can even help us with the \(P >> n\) problem! Elastic net has two regularization terms, called L1 (or the Lasso regularization term) and L2 (the Ridge term).

  • L1: makes some of the coefficients zero, thereby selecting only the most important features (shrinks them to zero). It is often said that this encourages sparsity in the model.

  • L2: reduces the coefficients to small but non-zero values, thereby reducing the coefficients of the unimportant features (which we can use to determine significance). It does this by adding a penalty term to the cost function of the model, which is proportional to the coefficient squared. This allows us to retain all the features in the model but reduces the overall impact of the non-significant or less-important ones.

Thus, by combining these techniques together, the Elastic Net is balancing feature selection with feature reduction! If you’re interested in reading the original paper on Elastic Net by Zou and Hastie (2005), you can find a free PDF of the paper here.

6.2.2 Running, tuning, & cross-validating the Elastic Net

The elastic net can be run using the glmnet package (which we are accessing via caret), and can be cross-validated as random forest and SVM from our last project could be. This is a big improvement over OLS regression, which cannot be tuned. Additionally, the elastic net has two hyperparameters that we can tune, \(\alpha\) and \(\lambda\) (not to be confused with the \(\lambda\) from Box-Cox transformation - Greek letters are abundant in statistics! Why do you think I named our course teams as I did?!).

  • \(\alpha\) is the strength of the regularization, representing the balance between L1 and L2 regularization. Larger \(\alpha\) results in L1 regularization (feature selection / Ridge) whereas smaller \(\alpha\) results in more L2 regularization (higher shrinkage / feature reduction / Lasso). When \(\alpha=0\) the regression is equivalent to OLS regression! \(\alpha\) is also referred to as the mixing percent of L1 and L2 regularization.

  • \(\lambda\) is the shrinkage parameter, such that when \(\lambda = 0\) no shrinkage is performed and is equivalent to an OLS regression. As \(\lambda\) increases, the coefficients are increasingly shrunk and thus the more features will have coefficients shrunk to zero, regardless of the value of \(\alpha\).

6.2.2.1 We will start by setting our CV parameters just as we did on the last project, using the trainControl() function in caret.

NOTE that I am performing a 5-fold repeated cross-validation (CV) with 5 iterations.

ctrl <- trainControl(method = "repeatedcv",  ## Do repeated CV
                     number = 5,             ## Number of k-folds
                     repeats = 5,            ## Number of repeats
                     search = "grid",        ## Grid search (vs. random)
                     verboseIter = FALSE)    ## Don't show me all the results

6.2.2.2 Then, we fit the model using the train() function in caret, specifying that we want to perform an elastic net, which will tune the model and perform CV per the specifications in the trainControl(). NOTE that I am allowing the values of \(\alpha\) to range from \(0 \rightarrow 1\) and allowing \(\lambda\) to range from \(0 \rightarrow 5\).

## Create a search grid that allows alpha to range from 0 --> and 
## allows lambda to range from 0 --> 5
searchGrid <- expand.grid(.alpha = seq(0, 1, length.out = 10), 
                          .lambda = seq(0, 5, length.out = 15))

## Fit the elastic net using the tuning grid and CV scheme laid out
elasticMod <- train(bc_PredictedReadmissionRate ~ ., 
                    data = readyTrain, 
                    method = "glmnet", 
                    tuneGrid = searchGrid,
                    trControl = ctrl) 

Question 28: [1 point]

Explain what we are doing here. You may want to look up the parameters of the trainControl() and train() functions. Make sure to address what type of CV and hyperparameter search is being performed, as well as if both the \(\alpha\) and \(\lambda\) hyperparameters of elastic net are being tuned.

Hint 1: Why are we doing cross-validation, and specifically why did I choose the one I did?

Hint 2: What does \(\lambda = 0\) mean? How about \(\alpha = 0\)?

Here, we are fitting an elastic net regression model using the train() function from the caret package. This allows us to tune the two key hyperparamters of elastic net - alpha which controls the mxi between L1 (lasso) and L2 (ridge) penalties and lambda which controls how much we shrink coefficents toward zero. We also set up a grid search over a range of values using expand.grid() so the model will test combinations of the two hyperparamters in order to find the best combination. We are using repeated k-fold cross-validation because it averages eperformance over multiple random splits. This way we able to evaluate the model more robustly because we test it on multiple subsets of data, thereby reducing overfitting and giving us a better understanding on how the model will perform on unseen data. Lamba=0 means there is no shrinkage penality and alpha=0 means there is pure ridge resgression (only an L2 penalty).

6.2.3 Elastic Net Results

6.2.3.1 We are going to first take a look at the top 15 results of the Elastic Net in a table form, sorted from lowest \(RMSE\):

Table 11. Top Performing Results of the Elastic Net Tuning & 5-fold Repeated Cross-Validation
alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD
0.00 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.11 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.22 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.33 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.44 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.56 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.78 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.67 0.00 0.78 0.39 0.62 0.03 0.03 0.02
1.00 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.89 0.00 0.78 0.39 0.62 0.03 0.03 0.02
0.11 0.36 0.80 0.38 0.63 0.03 0.03 0.02
0.22 0.36 0.82 0.37 0.65 0.03 0.04 0.02
0.33 0.36 0.83 0.36 0.66 0.03 0.04 0.02
0.44 0.36 0.85 0.34 0.67 0.03 0.04 0.02
0.56 0.36 0.87 0.32 0.69 0.03 0.04 0.02

NOTE: Folds that fail to converge will have an \(R^2\) = NA.

6.2.3.2 We can also extract the hyperparameters of our best tuned model:

\(\alpha\):

## [1] 0

\(\lambda\):

## [1] 0

Question 29: [1 point]

Check the values of \(\alpha\) and \(\lambda\) here. What do they suggest about the optimal regularization that is being fit here? (Recall that when both \(\alpha\) and \(\lambda\) are zero, it’s an OLS regression!)

The values of alpha=0 and lambda=0 suggest that the optimal regularilization selected by the elastic net tuning is no reguularization at all. Alpha=0 suggests. the model relies on ridge regression, but since lambda=0, it means no penalty is actually applied. This means the elastic net collapses to a simple OLS regression. Overall, this suggests that regularization did not improve predictive performance beyond what OLS would already achieve.

Let’s also plot the results of the tuning search. Can you figure out the type of tuning search you did? (Hint: the visual pattern should confirm your answer from earlier!) The diamond denotes the best combination of hyperparameters!

results <- elasticMod$results %>% 
  data.frame()
best <- results[as.numeric(rownames(elasticMod$bestTune)), ]
Figure 19. Hyperparamater Grid Search for Elastic Net

Figure 19. Hyperparamater Grid Search for Elastic Net

6.2.3.3 Important Features

Important features can also be extracted, using the varImp() function from caret.

important <- varImp(elasticMod)$importance
Figure 20. Variable Importance from Best Elastic Net Model

Figure 20. Variable Importance from Best Elastic Net Model

Question 30: [1 point]

Which of the features were most important? How does this compare to the most important features out of OLS regression? Make sure you’re looking at the absolute value of the beta coefficients to assess importance in OLS regression.

The most important features in Elastic Net are Hospital return days for pneumonia patients, medicare spending per patient, death rate for pneumonia patients, state, and venous thromboembolism prophylaxis. In contrast to OLS regression, the most important variables are Hospital return days for pneumonia patients (very lalrge negative effect), medicare spending per patient (positive), death rate for pneumonia patients (negaive), and state (positive).

How about about of PCA? You may have already noticed that importance almost seems inverted! Why?!

Hint 1: What is the fundamental difference between a PCA and any regression?

Hint 2: What does “importance” mean in a PCA vs. any regression?

Hint 3: What does the sign of the loadings mean in a PCA? Are they actual or arbitrary?

The fundamental difference between PCA and regression, is that PCA is an unsupervised learning method, it looks at the predictors and tries to explain the maximum variance among them without using the dependent variable. Regression is supervised and it explicity models the relationship between the predictors and the outcome.
In PCA, importance refers to how much a variable contributes to explaining the variance in the principal components. In regression, importance to how strongly a predictor explains or predicts the outcome. The signs of PCA loadings is technically aribtrary - it depends on the direction of the axis. In regression, the sign of the coefficent matters because it tells us the direction of the relationship between the predictor and the outcome (positive or negative).

6.2.3.2 Check model performance

Assessing model performance means assessing how well our training model does when applied to “unseen” data, here our readyTest dataset. Because we know the right answer - the true value of PredictedReadmissionRate, we can assess the error that the training model produces when fitted against the testing data. We measure that error as \(RMSE\) and \(MAE\). We briefly reviewed what the \(RMSE\) measures back in Question 36, so I will only review what the \(MAE\) is here.

Mean Absolute Error (\(MAE\)) is the average of the absolute differences between predicted and actual values, such that \(MAE = \frac{1}{n}\sum_{i=1}^n\cdot|y_i-\hat{y}_i|\) where \(y_i\) is each observed value, \(\hat{y}_i\) is each corresponding predicted value, and \(n\) is the number of observations. So, unlike \(RMSE\), \(MAE\) treats all errors equally and therefore is more robust to outliers. So, \(MAE\) will tell you the average magnitude of the error, regardless of direction!

Let’s also make predictions on the testing sets for both the OLS and Elastic Net models using the postResample() function in caret; this will not only do predictions, but it will enable us to more easily generate metrics that we can use to compare model performance, such as \(R^2\) and MAE.

## Post-resample on OLS
prOLS <- postResample(pred = predict(bestMod, newdata = readyTest), 
                      obs = readyTest$bc_PredictedReadmissionRate)

## Post-resample on Elastic Net
prElastic <- postResample(pred = predict(elasticMod, newdata = readyTest), 
                      obs = readyTest$bc_PredictedReadmissionRate)
Table 12. Comparison of Model (Test) Performance of OLS and Elastic Net Regressions
RMSE Rsquared MAE
prOLS 0.81 0.34 0.65
prElastic 0.81 0.35 0.64

6.2.3.3 Check for overfitting/underfitting

Using the same method, check for overfitting / underfitting. Make predictions using readyTrain rather than readyTest, and then check how the \(RMSE\) on the training data compares to the \(RMSE\) of the training data for the Elastic Net. How do they compare? How do both compare to what we got out of the OLS regression?

## Post-resample on OLS
prOLS <- postResample(pred = predict(bestMod, newdata = readyTrain), 
                      obs = readyTrain$bc_PredictedReadmissionRate)

## Post-resample on Elastic Net
prElastic <- postResample(pred = predict(elasticMod, newdata = readyTrain), 
                      obs = readyTrain$bc_PredictedReadmissionRate)
Table 13. Comparison of Training Performance of OLS and Elastic Net Regressions
RMSE Rsquared MAE
prOLS 0.78 0.4 0.61
prElastic 0.78 0.4 0.61

TAKEAWAY

When it comes to which performed better, it’s obvious that the Elastic Net only marginally (very marginally!) out-performed the OLS regression or the Elastic Net (Table 12). The \(R^2\) is a 0.01 higher in the Elastic Net model and both measures of error are a teensy bit lower. But, truthfully, they’re performing the same because they are the same model. The Elastic Net likely out-performed simply because it was cross-validated!

To interpret \(RMSE\) from training (Table 13) vs. testing (Table 12), we’re looking first to see if they are the same/balanced. However, we can see that error (\(RMSE\), \(MAE\)) is a bit higher in the testing post-resample vs. the training post-resample. While this could suggest that we have some mild overfitting (our model memorized the training data a little too well leading to poor generalization when it “saw” the testing data), instead I think it is more likely is that we have underfitting due to a lack of complexity! Underfitting error patterns are such that you will see roughly the same error in both the testing and training sets, and both error rates are quite high.

6.2.3.4 Graphical comparison of the two models’ predictions

Figure 19. Predictions Plot for Elastic Net vs. OLS Regression

Figure 19. Predictions Plot for Elastic Net vs. OLS Regression

It should not come as a surprise, but these models are performing VERY similarly!

Question 32: [4 points]

What conclusions do you make at this point about whether to make predictions with an OLS regression, Elastic Net, or neither? What recommendations would you make to our client at this point? What other analyses might your suggest, or other data we might want to include? Please try to flesh out 2-3 recommendations in at least a paragraph.

HINT: Make sure to ask yourself whether these data appear linear or not. Could non-linearity be a cause for a low \(R^2\)? If not, what other causes are there and how will that impact your recommendations?

At this point, we can conlude that both the OLS regression and the Elastic Net perform very similarly, with nearly identical RMSE, R^2, and MAE values on both the training and testing sets. Although the Elastic Net marginally outperforms the OLS regression, the improvement is slight - suggesting that adding regularization did not actually add much extra benefit. This tells us that neither method has meaningfully captured more predective power - either because the underlying relationships in the data are too weak/noisy/non-linear. I think we should avoid relying solely on these linear models for making predictions about pneumonia-related hospital readmissions because the predictive power is relatively low (R^2 around 0.35-0.40). This could lead to unreliable predictions in practice. We should explore non-linear models such as random forests/gradient boosting/support vecto machines. These can capture complex interactions, along with non-linear effects that linear models cannot. These models could uncover non-linear patterns. We could also consider adding additional data sources, along with more predective feartures. Important clinical, demographic, or socioeconomic variables were not included so adding them could meaningfully boost predective performance. We could review whether important interactions are currently being missed. Overall, the reccomendation is to use a more flexible machine learning approach (Random Forest) and enhance the feature set to achieve practical and reliable models for decision making. While OLS and Elastic Net gives a baseline they would be insufficent for robust predictions.

7 Choose Your Own Adventure

It’s time to choose your adventure! After working through this demo and answering the 31 questions, you will choose one of three possible trajectories. The following are some brief hints or tips for each of the choices available to you.

Adventure 1.

[Choosing one other condition to analyze.]

  • Make sure that you use the code I provided, updating it for your new condition

  • Make sure to answer all of the interpretative questions again for your new dataset.

  • I will specifically be looking for a comparison of which condition, pneumonia or the one you chose, seems to be a better choice for a predictive model for our client. Or is it neither?

  • Make sure to include some next steps or recommendations based on your new analysis!

Recommendations & Hints:

You can avoid going back to Demo 1 to do your importing, cleaning, and pre-processing by running this source code:

## Make any necessary changes in the R source code
## Uncomment to run!
# source(file = "reRunDemoData.R")
  • Make sure to update your filepath to where you stored the hospital files on YOUR local machine, just as you did in Question 1 of the Demo!

  • Make sure to update the condition that gets selected from “PN” to the condition of your choosing. A full list of available conditions is found in Demo 1.

Adventure 2.

[Choosing two or more conditions to analyze.]

  • Choose a slightly more complicated analysis to undertake, for example, focusing on surgical interventions (HIP-KNEE & CABG) or heart-related conditions (HF, AMI, & CABG).

  • See the same recommendations and hints as Adventure 1.

Adventure 3.

[Continue analyzing the pneuomnia data.]

  • Make sure to justify your choice.

  • Choose at least two new algorithms to perform. They can be unsupervised or supervised, but must be 2+.

  • A natural extension would be to do a Ridge and/or a LASSO; the former would tell you if you should be controlling for multicollinearity more than you currently are, the latter would tell you if you need to further shrink coefficients than you currently are. Additionally, you could consider a regression-based method that incorporates non-linearity, such as a random forest. However, you are free to choose anything you think is appropriate here!

  • Other choices I think you could potentially justify to stakeholders, if you were interested:

  1. Switch to classification-based random forest using ComparedToNational_Hospital return days for pneumonia patients

  2. kNN using the 2 clusters we’ve identified and named in our segmentation analysis

  3. Our data are also ready for a neural network analysis (e.g., a feed-forward NN)!

Deliverables

I am looking for 1-2 markdown documents with their knitted HTML. For example, if you’re choosing Adventures 1 or 2, you may want to work through this document once, make a copy of this markdown document, move the source code to the TOP, edit the source code as needed, and then go through this again from the beginning using the new condition(s) you chose.

Alternatively, if you’re choosing Adventure 3, maybe you just choose to add on to the bottom of this document, replacing the ‘Choose Your Own Adventure’ section with your actual adventure.

Just make sure to answer the questions well,and make sure to justify the decisions you make. Tell me WHY you’re choosing the condition(s) you are in Adventures 1 or 2, or Tell me WHY you’re doing the analyses you are in Adventure 3. Other than that, make this your own learning adventure!

References

Box, G. E. P., and D. R. Cox. 2018. “An Analysis of Transformations.” Journal of the Royal Statistical Society: Series B (Methodological) 26 (2): 211–43. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x.
De Alba, Israel, and Alpesh Amin. 2014. “Pneumonia Readmissions: Risk Factors and Implications.” Ochsner Journal 14 (4): 649–54.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.
Yeo, In‐Kwon, and Richard A. Johnson. 2000. “A New Family of Power Transformations to Improve Normality or Symmetry.” Biometrika 87 (4): 954–59. https://doi.org/10.1093/biomet/87.4.954.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society Series B: Statistical Methodology 67 (2): 301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x.